Monitoring XL Release and XL Deploy with open source tools
If you have experience running operations for any kind of application, you know how important application monitoring is. You do not want to be surprised with the fact that your application suddenly crashes because it was running out of hardware resources for weeks already. Even if you have insight into the hardware resources used, you want to be able to drill down into the application usage and see what behaviour is causing the problems. Common questions we hear from our customers are: How many users are active on XL Release or XL Deploy? What exactly are they doing and when are they doing it? How does this behaviour relates to spikes in resource usage? Is our server setup capable of handling the amount of releases we are currently doing? And how about next month when we scale up? At XebiaLabs we run our own products in production to help ourselves to deliver our own software to our customers. Like our customers, we have monitoring needs ourselves for our own tools. In this post I would like to show you how we've setup monitoring for our products and what is possible with the information exposed by our tools today.
The search for information
In the previous section we already presented a couple of questions which could potentially be addressed by monitoring. If you want to setup monitoring, you typically start with collection information. Keeping above questions in mind, lets have a look at which information is available today and where to get it.
- Operation system metrics. Every operating system exposes many high-level hardware resource metrics. Think about system load, CPU usage, memory usage, threads, hard disks usage, network I/O. etc. All these metrics says something about the applications running on this machine as well. If we can relate this information with application behaviour, we can determine the limits of the system.
- Log files. Log files expose a lot of information about actual usage but also and erroneous behaviour of our applications. Our tools provide 3 types of log files out of the box:
- Application log. This log files contains information about application runtime information and error messages of internal components.
- HTTP access log. This log file is mostly useful for determining user behaviour. Also it contains information about errors and request duration.
- Audit log. The audit log contains information about events in the system.
- JMX. JMX is a technology available on the java platform. It exposes valuable runtime information about the JVM and it allows applications running on the JVM to publish information about themselves. When enabled, XLR and XLD expose a variety of MBeans. Like already mention information that can be found here is twofold:
- JVM. Generic information about the heap (memory) and garbage collection, JVM threads, JIT compilation, etc.
- MBeans. Our applications exposes MBeans that provide information about internal technical comments like thread pools and jdbc connection pools. This information can be used to figure out if the application is correctly configured. But there are also MBeans exposing metrics about API usage and request duration. This can be used to understand the usage of our applications.
The approach we took is to start collecting all the information from the sources we've discussed in the previous section. We collect them in 2 different databases and we've build a single dashboard capable of aggregating the data from all sources and displaying it in a single overview. We've selected the following tools for our monitoring stack:
- CollectD. This general purpose tool is a common way to collect operation system metrics on linux systems. We also use it to read out JXM data and send it to our InfluxDB database.
- Logback-logstash-encoder. We've used a logback extension to push our application, access and audit log events directly to Logstash.
- Logstash. Logstash is a service that can parse an manipulate log streams and store them into Elasticsearch.
- Elasticsearch. This powerful search and aggregation engine powers the backend of our dashboard. The data from our logfiles is stored here.
- InfluxDB. This is another search and aggregation database specialised in storing time series data. The data from the OS metrics and from JMX is stored here.
- Grafana. We've chosen Grafana as our dashboard platform because it supports multiple datasources like Elasticsearch and InfluxBD. Its a very powerful tool specialised in visualising time series data.
We've selected open source tools because they are accessible, if you have bigger demands or you need an easier setup there are great commercial solutions available like AppDynamics, Dynatrace or New Relic.
Application monitoring dashboard
Having everything in place, lets have a look at the first results. We've build a dashboard consisting of 4 parts. Lets have a look at the dashboard for our XL Release production instance.
This dashboard is supposed to give an overview of the "Health" of XL Release. It contains a graph showing HTTP request activity (green) and active users (yellow) over time. The middle graph shows request duration over time. The lines show the 50th, 75th and 99th percentile. The graph at the bottom shows a combination of HTTP and log file error counts. Having all information on a single page and plotted on the same time axis makes it easy to spot links between events. Notice also the blue and red vertical bars in all graphs. These are notations on events. Red shows server startups (showing version information in a pop-up), and all blue bars are releases starting. These bars are shown in all other graphs too.
Technical implementation details
TODO - if we need this at all, maybe a next post or refer to documentation?
All the things described in this post are done with information that is available today and exposed by our products out-of-the box. For our monitoring stack we've used open-source technology and everything is done with pure configuration, no extension were needed. Setting up monitoring for our products hopefully will help you to make operations easier. But it can also serve a different purpose. If you ever run into performance issues, being able to provide us with this type of information is also very helpful for us when helping troubleshooting your problems. Using our own and our customer experiences, we are constantly looking for how can we improve our products to expose the information for our monitoring needs. And possibly at some day we want to provide an in product solution for monitoring.