Are You Using the Right Analytics to Keep Your Applications Running Smoothly?

Aligning the right metrics to the right use case allows for timelier reporting and reduces the risk your application could fail.

Analytics are essential to the successful operation of every modern SaaS application. Effectively managing a SaaS application requires continuous tracking of its performance, what’s going on inside the application, and whether or not it’s accomplishing its goals.

However, there is a wide variety of analytics that need to be monitored and tracked to successfully run an application. The purpose, value, accuracy, and reliability of those analytics vary greatly depending on how they are measured, how they are used, and who makes use of them.

There are essentially three different classes of analytics, each with radically different use cases:

Class A analytics

Class A analytics are metrics that are essential to running your application. Without these analytics, your application could fail in real time. These metrics are used to evaluate the operation of the application and dynamically make any needed adjustments to keep the application functioning.

The analytics are part of a feedback loop that constantly monitors and improves the operational environment of the application.

A prime example of Class A analytics are metrics used for autoscaling. These metrics are used to dynamically change the size of your infrastructure to meet the current or expected demands as the load on the application fluctuates.

A well-known example is the AWS Auto Scaling cloud service. This service will automatically monitor specific Amazon CloudWatch metrics, looking for triggers and thresholds. If a specific metric reaches specific criteria, AWS Auto Scaling will add or remove Amazon EC2 instances from an application, automatically adjusting the resources that are used to operate the application. It will add instances when additional resources are needed, and remove those instances when the metrics indicate the resources are no longer needed.

AWS Auto Scaling allows you to create a service, composed of any number of EC2 instances, and automatically add or subtract servers based on traffic and load requirements. When traffic is lower, fewer instances will be used. When traffic is higher, more instances will be used.

As an example, AWS Auto Scaling might use a CloudWatch metric that measures the average CPU load of all the instances being used for a service. Once the CPU load goes above a certain threshold, AWS Auto Scaling will add an additional server to the service pool.

Note that, if for some reason those Amazon CloudWatch metrics are not available or they are inaccurate, then the algorithm cannot function, and either too many instances will be added to the service, which will waste money, or too few instances will be added to the service, which could result in the application browning out or failing outright.

Clearly, these metrics are truly essential. The very operation of the application is jeopardized if they are not available and correct. As such, they are considered Class A metrics.

AWS Elastic Load Balancing is another great example. AWS automatically adjusts the size and number of instances necessary to operate the traffic load balancing service for a particular use case, depending on the current amount of traffic going to each load balancer. As traffic increases, the load balancer is moved automatically to larger instances or more instances. As traffic decreases, the load balancer is moved automatically to smaller instances or fewer instances. All of this is automatic, based on internal algorithms making use of specific CloudWatch metrics. If those metrics are not available or they are incorrect, the load balancer won’t size appropriately, and the ability of the load balancer to handle the traffic load could suffer.

Class B analytics

Class B analytics are metrics that are not business-critical, but are used as early indicators of impending problems, or are used to solve problems when they arise. Class B analytics can be important for preventing or recovering from system outages.

Class B metrics typically give insights into the internal operation of the application or service, or they give insights into the infrastructure that is operating the application or service. These insights can be used proactively or reactively to improve the operation of the application or service.

Proactively, Class B metrics can be monitored for trends that indicate an application or service might be misbehaving. Based on those trends, the metrics can be used to trigger alerts to indicate that the operations team must examine the system to see what might be wrong.

Reactively, during a system failure or performance reduction, Class B metrics can be examined historically to determine what might have caused the failure or the performance issue, in order to determine a solution to the problem. These metrics are often used during site failure events, and afterward during postmortem examinations.

During a failure event, Class B metrics are used to quickly determine what went wrong, and how to fix the problem. Afterward, they are used to improve the Mean Time to Detection (MTTD)—the amount of time it takes on average to find a problem during an outage—and the Mean Time to Repair (MTTR)—the amount of time to determine how to fix a problem during an outage. Both of these are critical goals for high-performance SaaS applications.

Yet, these metrics are not the same level of criticality as Class A metrics. If a Class A metric fails, your application could fail. But if a Class B metric fails, your application won’t fail. However, if your application has an issue, it might take longer to find and fix the problem if your Class B metrics aren’t functioning correctly.

There are many examples of Class B metrics, and there are many companies focused on generating these metrics, such as AppDynamics, Datadog, and Dynatrace. Class B metrics can also include logging and other metrics from companies such as Elastic and Splunk.

Class C analytics

Class C analytics involve metrics that are used for offline application analysis and longer-term planning purposes. Class C analytics are often used to determine the strategy and product direction of an application.

These metrics may be examined in real time, as Class A and Class B metrics are, or they may be issued and examined periodically, such as weekly, monthly, or quarterly.

Class C metrics are used for business analysis, such as analyzing customer traffic patterns, time on site, referring sites, and bounce rates. They can be used for sales reports and sales funnels. They can be used for financial reports and auditing purposes.

Some shops test new application features or new wording for their websites by showing two or more different versions of the feature to customers, and analyzing metrics to see which one performs better. This is called A/B testing, and the metrics used are Class C metrics.

There are many companies that provide Class C metrics, but by far the most well-known Class C metrics provider is Google Analytics.

Not all analytics are created equal

Different metrics have different consumers. The consumer who cares about the metrics is specific to the category the metrics belong to:

  • Class A metrics are mostly consumed by automated systems and are used internally by systems and processes. They are used to dynamically and automatically update critical operational resources in order to keep a system healthy and scaled appropriately.
  • Class B metrics are mostly consumed by operations and support teams, along with development teams, as part of the incident response process. They can provide immediate assistance to teams in identifying and fixing problems, and generally help in preventing problems before they occur.
  • Class C metrics are mostly consumed by business planners, product managers, and corporate executives. They are used to drive longer-term business decisions, business modeling, product design, and feature prioritization.

Additionally, and perhaps most importantly, systems that collect and process analytics have different priorities within your application. Problems collecting Class A metrics are mission-critical problems. A failure of a Class A metric could result in automated infrastructure tools doing the wrong thing and ultimately result in brownouts or blackouts.

By contrast, problems collecting Class C metrics are not necessarily cause for alarm, and addressing a Class C issue could be postponed for hours, days, or even longer.

Be very careful when deciding how to use a metric; mistakes in using metrics for the wrong purposes can be disastrous. For example, don’t use a Class B metric, such as “application latency,” to dynamically and automatically allocate system resources, such as autoscaling up and down your server fleet. Why? Because using Class B metrics in mission-critical use cases such as this introduces unnecessary risk into your application.

Let’s say you are receiving metrics from an application performance monitoring company, which are typically classified as Class B metrics. Using their reported “application latency” to determine fleet scaling would leave you open to potential problems. If your application performance monitoring company has an outage, you would not be able to correctly scale your fleet, and it could cause you to have an outage. This means that your application performance monitoring company is now a mission-critical component of your application, where before it may have just been a useful and valuable tool for diagnosing problems.

As another example, don’t rely on a Class C metric, such as “shopping cart abandon rate,” as the primary way of identifying an operations availability problem in your cart service. The metric is too far away from the problem, and would not give you the timely indication of a problem in need of resolution. Your report that “sales are down this week due to an increase in cart abandons” is too little and too late to assist you in debugging earlier cart service problems.

Using the right metric for the right purpose will increase the usefulness of your analytics, allow timely reporting, and reduce risk to your application and business.

More articles from Lee Atchison:

Image by Reto Scheiwiller from Pixabay.