A modern approach to monitoring applications with a set of “off-the-shelf” open source tools.
[Nir Dotan is a system architect at Amdocs, and one of those people I enjoy working with.. Here’s what he has to say about monitoring large scale systems.]
If you have software application running in production, you surely want to monitor it, in order to make sure that it is doing what it’s supposed to. Because if it isn’t, then someone is probably losing money, and it may very well be your operation.
It seems that Monitoring made its penetration into the IT world from the direction of its next door neighbor, the network family. These 2 families may be moving in together, because it seems that their real-estate agent, Mr Virtual Machine, sold them the same house (or should I say leased?) but that’s a different topic altogether.
Back to the kingdom of network elements, where SNMP and PING are king and queen, it is often a complex and expensive operation to setup effective monitoring, especially if it’s a large network with lots of proprietary equipment.
In the IT world, things are actually getting easier and cheaper to monitor, despite or maybe thanks to the migration of the industry to highly distributed architectures and the penetration of cloud.
If you have basic development skills, and just a little bit of Linux background, you can setup a pretty effective monitoring system with relatively little effort, no license costs (FOSS), and most importantly, you might enjoy the journey. I certainly did or actually am still.
Monitor applications with business and user experience in mind
The applications are providing services to our customers. The first thing that we want to make sure is that they are available to their users. We also need to verify that their response time is reasonable, error rates are low, and that backlogs are not growing in our asynchronous queues, because if they are growing too rapidly, we may not be able to catch up, and the enemy once again, is not a CPU or disk, the enemy is poor customer experience which must be avoided.
Having said this, of course we do want to monitor our OS and HW. Applications cannot serve customers well if they don’t have sufficient resources.
Choosing the right tools
There are many FOSS monitoring tools out there. The first problem with many of them is that they simply do not scale, so look out for that.
The second problem is that apparently, there’s no single tool that does everything that is needed and excels in it. In my onion at least 2 different tools, if not more, are required.
So what are the center pieces of required functionality?
- Collect metrics from your applications
- Aggregate metrics collected from different instances into summary level metrics
- Persist metrics to a longer term storage
- Preform trend calculations and render trend graphs to a UI
- Display service statuses clearly in a UI dashboard
- Alert by email or text upon status changes.
My pick
For collection and aggregation, I would go with Etsy’s StatsD, as a matter of fact I would take a close look at anything that Etsy is using or developing in the DevOps area, they’re quite a leader in this industry and have many devoted followers.
You can use the StatsD client from your application code in most programing languages, and send out your metrics to the StatsD aggregator server as a UDP message, with minimal impact on your application’s performance. Sending a metric is as easy as (in PHP)
StatsD::increment(“grue.dinners”);
Be sure to read all about it in Etsy’s blog http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
Now that we have metrics, we can finally do something interesting and get to the central piece of this post: Graphite, which integrates natively to StatsD .
Graphite is a highly scalable product which specializes in efficient storage, trend calculations and graph presentations in its own webapp UI. In addition it exposes a simple yet very robust graph rendering API, which you can accessed from other systems.
Graphite is great, it is very popular, and there’s a ton of information and tips that you can find online. The best thing about graphite in my opinion is that it is very simple and intuitive to use. I could easily write 10 more pages about Graphite, but instead I’ll refer you to the best resources that I’ve come across, that is, in addition to the reference documentation.
Jason Dixon has valuable tips and insights, which he blogs about. I also recommend following Jason’s installation instructions webcast. Once you’re up to speed and really want to understand how the product works, read this article written by Chris Davis, the master himself who’s behind Graphite.
While Graphite is highly regarded as a great product, it falls short in 2 areas:
UI
The Graphite UI does not look good. It lacks the trendy dashboard widgets concept, and does not make use of modern JS charting libraries that people have come to expect.
That’s not a very big problem, because there are many “pretty faces” that you can add on top of Graphite. Here are my favorites:
- The relatively new Grafana, looks great and supports UI based dashboard configuration. You’ll be up and running in 10 minutes. This is my first pick for a Graphite dedicated dashboard.
- Dashing is more suitable as a general purpose dashboard. Use it if you also want to present data that is not pulled from Graphite. With this one, you will not be up and running a Graphite graph widget in 10 minutes.
- Seyren maybe a solution that solves both this problem as well as the next one.
Alerting
Graphite lacks alerting capabilities. For some people, alerting is the most important aspect of monitoring. While I favor the approach of taking it easy on the alerts, you do need to be able to alert in critical situations. I must admit that I did not find a perfect FOSS solution to this. The most common solution is Nagios or one of its forks. Be warned that they are all licensed by various types of GPL, which at least in my organization, is hard to get through the legal department with.
If you do use Nagios, just for status and alerting, there’s no real need to start monitoring all your servers, which will quickly get you to Nagios’ configuration nightmare. All you have to do is monitor the Graphite http API, and for this I recommend going with Jason Dixon’s version of check_graphite plugin. Read all about it in his blog post.
There are so many tools on the market to monitor applications that it is not necessary at all to create anything at all. I just bought the software I liked more. That was Anturis, a cloud-based tool with all in one options to monitor i.e. to monitor servers, websites and networks etc. I think it is more convenient than to have several of them. There is a big choice of everything. The most important thing is to have a desire to make things better.