In this post, we’ll cover how to plan your application performance monitoring program.  An effective application performance monitoring program allows you to do the following:

  • Get insight into how your applications are performing on your user’s devices.  Locate problem areas to address.
  • Measure the performance of your applications on your back-end infrastructure.  Identify bottlenecks for optimization.
  • Get alerted quickly if performance degrades, allowing for fast time to resolution.

All of these items ultimately contribute to a better user experience, by delivering a problem-free, responsive application.  The principles here primarily apply to your production systems, but can be used to monitor the performance of your development or test environments.  They are relevant whether you are hosting in a dedicated data center or on the cloud.

Key Business Metrics

First and foremost, you should have defined a set of metrics that reflect the business performance of your product.  These are generally the actions that a user is taking as they engage with your product, whether on the web or a mobile app.  Examples of user actions are registrations, logins, messages sent, product sku views, photos uploaded, searches performed, items added to shopping cart, sales, etc.  These metrics are usually the type of interactions that a product manager would care about.  If you are asking why the Engineering team would want to track these, it’s because they are the metrics that ensure your business will be open in a year and will be discussed at Exec team meetings. They provide a clear view of whether the product is performing as expected.

The reason you want to track as many of these metrics as possible is because they contribute to the overall view of the product’s performance. The actions selected shouldn’t just be at the end of the sales funnel, like making a payment.  They should capture every step in a set of user interactions.  If you track just the final outcomes, like sales, you might not be able to figure out why sales have suddenly dropped.  For example, sales could drop if users can’t register on the site, indicating an issue with your registration system (maybe the Facebook API is down, blocking registration with Facebook credentials). However, sales might also drop if your payment provider is offline (and registration counts are still inline).  Tracking as many business metrics as possible allows you to quickly make these distinctions.

These metrics need to be collected and graphed.  Don’t just rely on the summary BI Dashboards that the product managers examine.  This data is usually aggregated hourly or daily, and is too delayed to identify a problem in real-time.  You should collect these business metrics continuously (per minute ideally).  The key to graphing granularity is that you should be able to see a very distinct point in time where the metric changed.  This will aid immensely in troubleshooting, as you could associate the change with a planned event, like a code release or a scheduled maintenance.  Your graphs of these metrics can be generated in the same system you use for operational metrics (I have used Ganglia in the past for this). Just find out from the product managers or analysts what fields in your transactional database are being used as the source for each metric in their report. Your graphing system can query the same field.

Client Application Monitoring

Next, you should have a view into the performance of your applications from the client’s perspective.  This is usually accomplished by simulating the client application through a set of emulators.  These emulators can represent different browsers or mobile device runtimes, which are then loaded with scripts of transactions that mimic the path that a user would take through your application.  With these emulators, the load times and responsiveness of the application is measured and recorded. This approach to measuring client application performance is called synthetic monitoring.

The simplest way to get a view of this kind of data is to run a tool within your own browser.  An example is Chrome Developer tools, which provides an easy way to view web content loading on demand from your location.  In early development, this provides an easy way to start benchmarking and optimizing your content load times.

On a live product with many users, you want to gather this kind of performance data continuously from many geographically dispersed points.  This is where synthetic monitoring vendors can help.  Two examples are Dynatrace and New Relic.  These provide the ability to set up scripts of common user interactions or paths through your application.  Then, you can program the paths to be repeated by the emulator at some interval, say every 5 minutes.  You can also choose geographic locations across the globe where you want the emulators to run.  Good vendors will have more than 100 nodes to choose from.  However, in my experience you can glean enough insight from 5-10 nodes.  Depending on the distribution of your audience, you would want at least two nodes in the US, then one on each continent.  While the temptation would be to max out the count of nodes and test frequency, most vendors will charge you based on the number of measures.  So, more nodes and higher frequency will generate more cost.

Screen Shot 2016-01-12 at 10.56.00 AM

Once the data is collected, these services will provide aggregate reports showing overall load/traverse times for your user experiences, with the ability to drill into any one to troubleshoot problem areas.  This is usually accomplished through interactive reports with rich UI.  You can also set up monitors to alert you if a particular event or sequence exceeds a maximum expected time.  The idea behind this is to notify you if your application is down or severely degraded.  It may take some time to tune your monitors, though, and expect some false alerts at first.

If the vendor solutions don’t fully meet your needs or you want to track a more customized action, you can also accomplish this kind of monitoring by embedding code into your client applications that make callbacks to a central server.  The callbacks are usually made over HTTP and carry a payload of data indicating the user event, time, and other metadata associated with the interaction.  The central server will then parse this payload and write it to some sort of log file. These log files can be streamed to your data analytics system (like Hadoop), where they can be processed and aggregated for reports.  If you don’t want to set up your own callback collection server and Hadoop integration, you could send callbacks to a third party analytics system, like Google Analytics.

Application Servers

Monitoring your application servers is sometimes referred to as APM, or Application Performance Management.  This type of monitoring is focused on gathering data about two things – the transactions running on your app servers and the amount of time it takes to complete the transaction.  Transactions here are usually defined as an HTTP request and its subsequent response.  The transactions are then aggregated by type, usually the name of the script called to process the request. Response time is averaged.  This data is graphed, allowing an engineer to view server performance by transaction type.  Similar to client application monitoring, this approach also allows the viewer to drill into an individual transaction and see processing times by component.  Components can be code execution by function, database calls or other external dependencies.

Since this type of data is available in your app server’s logs, you could roll your own simple solution to this need.  I have seen teams write basic unix scripts that tail an app server’s logs, parse out the individual script names from URLs, retrieve the run time and then aggregate in a simple graphing tool.  This approach can get you off the ground, but will require a lot of maintenance and can become unwieldy quickly.

I have used two different vendors for server-side APM, New Relic and AppDynamics.  New Relic is a cloud-based service with code plug-ins for most server types, that is easy to install and configure.  They provide very rich, intuitive reporting interfaces.  You can view average response times aggregated for the entire application and also broken out by transaction type.  For each transaction, you can further expand the run times into individual components.  This allows you to quickly identify bottlenecks, whether in code execution, database access or external services.  The data and graph updates occur in near real-time. My team always kept the New Relic dashboard in view during deployments and other planned maintenance to allow us to quickly see if the event generated an impact on application performance.  This tool was invaluable in determining whether a release was stable or needed to be rolled back.  AppDynamics provides a similar set of features.  It does require a bit more configuration than New Relic, as we used the collector/aggregator software installed on our own servers in our data center.  However, we also were also able to get a very useful view in AppDynamics that showed a visual map of all back-end servers (app, database, file system, etc.) with links between them.  This allowed for quick diagnosis of application performance issues that were tied to a dependency on a particular degraded system.

In our configuration at Zoosk, we used New Relic to monitor performance on our front-end app servers, which were a combination of Apache and node.js. With AppDynamics, we monitored performance of services behind the front-end app servers. These ran on Tomcat and nginx.  Here are two screenshots from New Relic and AppDynamics, showing examples of different views you can get.

Screen Shot 2016-01-12 at 12.37.17 PM

Screen Shot 2016-01-12 at 12.36.57 PM

Other System Monitoring

You should also have performance monitors in place for other major components of your back-end infrastructure.  This measurement can be performed using APM tools as well in some cases, or be accomplished with individual solutions, depending on the system.  I have measured the performance of back-end systems in the past using Ganglia for data gathering and graphing. It has a client for most major Unix based systems.  The client runs on each system component, gathering data and forwarding it to a central Ganglia server.  The Ganglia server aggregates the data and then uses RRDtool for data storage and visualization. There are other solutions for this as well, and most cloud providers offer a performance monitoring service integrated into their stack (Cloudwatch on AWS).

Here is a list of some example back-end systems that can be measured.  Basic hardware and OS level metrics should be collected on all servers.  This includes CPU utilization, load, network throughput and disk usage.

  • Messaging systems.  If you are using message queues to pass work items between systems, you want to track how many messages are in each queue awaiting processing.  This will give you a sense for throughput.  If the number of messages in a queue continues to increase, this can indicate a capacity issue or an error condition.
  • Databases.  You definitely want to collect and graph common database processing activities.  This would include all query types – select, update, insert, delete.  Also, a view of slow queries helps identify when a particular query needs optimization or database performance on the whole is starting to degrade.  There are many more metrics associated with each individual data storage system that you can collect and graph.
  • Email servers.  If you run your own mail servers for sending customer emails, you will want to graph the outgoing email queues.  This is particularly insightful if split by ISP (Gmail, Yahoo, Hotmail, etc.).  Major email ISPs will sometimes throttle incoming mail and you will see this reflected in the outbound queue for that ISP. This will alert you that email to a particular customer set may be delayed.
  • Network equipment.  Most network hardware vendors allow you to pull basic performance data from their equipment.  If so, it is useful to graph network throughput and performance of hardware subsystems, like CPU.  This applies to your switches and routers.  As an example, seeing network traffic graphed on a 10gb switch may indicate if you are approaching saturation.

This represents just a subset of the type of systems that could be running in your back-end.  The basic idea here is to collect and graph a set of metrics that are relevant to measuring the performance of that system.  If performance issues are manifesting further up in the application stack, you have the ability to trace back to a particular back-end system that may be overloaded.

Alerting

Beyond graphing, you should have a system in place that is monitoring these same health metrics and will generate a warning if a metric crosses a certain threshold. A traditional solution for this is Nagios.  Nagios can monitor basic host health checks and network availability.  It also has a suite of plug-ins for most open source servers that allows for service checks relevant to that server’s function.  Your Ops personnel can set thresholds for generating an alert and how that alert is sent.  Alerts can be organized into levels, like warning and critical.  Alerts should be sent to your Ops personnel via SMS and email.  They can also be injected into your team collaboration software, like Slack or IRC.  There are helpful vendor services for managing the distribution of these alerts to an on-call rotation of Ops personnel.  The most popular of these is PagerDuty.