Keeping applications up and running is essential to successfully running a modern business. Follow these key steps to reduce your risk of downtime and keep customers happy.
As a software application scales, it invariably becomes more complex. And with that increase in complexity comes the increased risk of problems that could potentially impact the application’s availability.
Take, for example, the case of a well-known monitoring company that suffered from serious availability problems while it was growing from a small to a midsize company. Its traffic was increasing dramatically, but its infrastructure couldn’t keep up. Worse yet, the company didn’t always know when it was having problems, nor did it have the ability to anticipate them.
How do you avoid availability problems in your application? And how do you mature your application as you scale so that you can meet your customers’ growing demand?
It’s not easy.
Improving application availability is not about writing the correct code; it’s more about improving the operational processes, procedures, and even the culture of your organization in order to instill the practices necessary to maintain availability.
The good news is that there are five concrete steps all companies can take to improve application availability and reduce the risk of operational problems.
Step 1. Understand your risks
Many people do not realize how much risk is inherent in their applications. Much of this risk is in the form of technical debt in the code, but some of it is based on known decisions that were made about how the system should operate, which implies outcomes that were unexpected or “unknown.”
Former U.S. Secretary of Defense Donald Rumsfeld famously once stated that there are “known knowns” and there are “known unknowns,” but that the problems to be concerned about are the “unknown unknowns”—those problems that we don’t even know we don’t know about.
Risk management is about removing the unknowns and making them into knowns. In the case of modern applications, risk management is about identifying areas of concern, labeling them, quantifying them, and prioritizing them. Then, addressing the risks that have the highest impact to our business.
To do this, each development team for each service in your application should create and maintain a risk matrix. A risk matrix is a spreadsheet that contains a list of as many issues and potential issues as possible. It’s developed by everyone with a stake in the service brainstorming together to identify as many potential risks as possible. Then each risk is assigned two numbers, which correspond to the level of:
- Severity: How serious a problem would it be for our business if this were to happen?
- Likelihood: How likely is this to occur?
A risk can have a high severity, but a low likelihood, meaning that it isn’t likely to happen, but if it does, the impact would be significant. It can have a high likelihood, but a low severity, which means the risk is more than likely to occur but won’t be a serious problem.
The most concerning risks are the ones that have a high likelihood and a high severity. They pose very serious problems to our business and are likely to happen. These are the highest impact risks.
A risk matrix provides a model for each team to prioritize their operational workload to understand what is important to work on and what is not important. Done correctly and consistently, it can be used to prioritize risks across teams and allow management to allocate resources to the greatest issues.
Risk matrices give visibility and prioritization to technical debt and pending problems. They are a great communications tool between development teams and management.
Effective use of risk matrices will help reduce availability issues in your application.
Step 2. Watch your software
Understanding what your software and your operational infrastructure is doing at any given time is critical to maintaining high availability. Application and infrastructure analytics can give you insight into how your application is performing, allowing you to tune and optimize your operational environment, detect and resolve live operational problems, and understand who is using your software and how they are using it.
Used and set up properly, analytics can give early indications of pending availability problems, allowing you to fix an application or operational issue before it becomes an availability problem.
There are many free and paid systems and services that provide application and infrastructure metrics and analytics. All of them have advantages and disadvantages. Free systems are valuable for those who want to build and maintain their own systems, and even customize them to fit their particular needs. Paid systems can offer a more hands-off experience, but often require a significant financial investment. More modern paid systems even offer AI systems that analyze your application performance for you, and give you early indicators of problems that you may not even notice among the depths of data available.
A full system to analyze your software provides the ability to:
- Monitor your system continuously to assess how it is working.
- Examine changes in performance around deployments, to see if a deployment may have introduced a problem, or to verify a problem has been resolved.
- Inform you via notifications when anomalies of various sizes or shapes are detected, allowing you to look at deeper data to determine what might have gone wrong.
- Assist you in resolving an ongoing incident, using data that can help understand why a particular problem is occurring.
Analytics are also a great way to monitor service-level agreements (SLAs). This includes both public SLAs (those visible to customers) and internal SLAs (those that describe commitments between and among internal services). Analytics are a great tool for inter-team communications.
Step 3. Reduce your technical debt
Once you have analytics in place and have identified your technical debt and other problems via your risk matrix and other tools, you need to evaluate and reduce your highest-impact problems. Knowing what your problems are is great, but it doesn’t help if you don’t work on reducing those problems.
If you have a high-severity, high-likelihood risk on your matrix that is driving availability issues, it must be fixed. But fixing it doesn’t necessarily mean rewriting to remove the risk. You can resolve the availability issue by reducing either the severity or the likelihood of the risk.
More articles from Lee Atchison:
- Are You Using the Right Analytics to Keep Your Applications Running Smoothly?
- How Data Sharding Can Affect Your Ability to Scale
- How to Ensure Continuous Availability with Multiple AWS Accounts
In other words, if you can’t easily remove an issue that’s causing you problems, then either make the issue happen less often—so that it’s not a frequent source of concern—or reduce the impact of the problem when it does occur by reducing the severity. Either way, the end result is that the problem is no longer a major driver. It may still be a recognized risk, but the reduced frequency or reduced impact makes it no longer a critical concern.
Having a regular focus on technical debt helps keep availability in line. But be careful you aren’t looking for perfection. Your goal should never be to remove all technical debt, and hence remove all risk. Unless you are building the control software for an airplane, rocket, or similar system, you need to balance effort with the impact of the problem. Focusing on reducing technical debt too far may indicate that you are spending too much time focusing on “perfecting” software at the cost of some other business opportunity.
Step 4. Automate recovery as much as possible
When an incident does occur, how long it takes to recover can have a huge impact on your overall application availability. It’s important to recover fast, but it’s also important to correctly diagnose the problem and take steps to ensure it doesn’t occur again.
When an availability incident happens, the response generally involves the following steps:
- You notice that a problem is occurring (either you detect the problem, or a customer reports the problem).
- You analyze what’s causing the problem.
- You roll out a remediation to reduce or eliminate the problem.
- You implement a permanent fix if necessary.
- You hold a post mortem on the episode.
This same sequence of events occurs every time there is an event. The problem is this process takes time. The time between when the problem occurs, or when it is first noticed, and when a remediation is put in place to remove the problem is called the mean time to repair (MTTR). The longer your MTTR, the lower your availability. Because humans are involved in diagnosing and fixing the problem, your MTTR can be quite long, impacting customer satisfaction.
However, sometimes you are aware of certain types of problems that can occur, and the process to fix the problem can be quiet and automated. By automating the repair of these types of problems, you can dramatically improve your MTTR.
A classic example of an automatable repair is when a computer instance goes offline. This can happen due to a software problem, a network problem, or another cause. But monitoring software can detect when the instance stops responding, and the instance can be immediately rebooted. Or, in the cloud, the instance can be terminated and replaced with a new instance. This can occur automatically. Because a human doesn’t have to be involved, your MTTR for this class of problem can be reduced, which can improve your availability markedly.
Step 5. Try to break things regularly
The best way to keep your application operating is to try and break it regularly.
Yes, that’s right. You heard me correctly.
The operators of the biggest applications in the world regularly test their resilience to problems by trying to break their application on a regular basis.
The idea is this: Your software will fail. But do you want it to fail in the middle of the night or at a critical time operationally? Or would you rather have it fail at a more opportune time, with your engineers looking on and ready to detect and fix the problem quicker?
In either case, you gain valuable experience on how your application operates. In the first case, you provide a bad experience and potentially long-lasting damage to your customers while you try to figure out what’s wrong with the application. In the second case, you know what caused the problem (you caused it) and you can quickly fix it. Your learnings are the same, but the costs of the lessons are far less.
There are two common ways to accomplish this production operation testing. The first is via an exercise called a game day, which is a scheduled time when you inject specific failures into your operational infrastructure in order to see how the problem manifests and how quickly you can detect and fix it. A common game day test scenario, for example, is to bring down an entire data center to see if your application can fail over to a backup data center.
The second common method of production operation testing is called chaos testing. Chaos testing involves having a software system operating that, randomly and unpredictably, breaks parts of your system on a regular basis. This might involve crashing a server, breaking a network link, or taking a load balancer offline. Chaos testing is a great way to test automated recovery mechanisms and prove the safety and efficacy of your recovery processes.
In either case, the goal is to identify problems in a controlled manner, learn from the errors, and improve the quality of your application to be able to self-repair from these failures. The twin goals of both approaches are to improve your operational reliability and application availability.
Improve processes, improve availability
Improving application availability is not about striving for perfection or eliminating every risk. It is much more about improving your operational processes: working to reduce the severity and likelihood of problems, closely monitoring applications and infrastructure, keeping technical debt in check, automating recovering mechanisms, and regularly putting those recovery mechanisms to the test. Follow these steps, and your application availability will be markedly improved, your customers will be happier, and those happier customers will mean more business for your company.