What Model Airplanes Teach Us About Avoiding Application Failures
I learned to fly radio-controlled airplanes when I was a kid, and one of the most important rules I remember was “Always keep your airplane at least ‘two mistakes’ high.” When you are learning to fly a model airplane, especially when you begin to attempt acrobatics, you learn this lesson quickly because mistakes equal altitude. You make a mistake, you lose altitude. As you can imagine, losing too much altitude makes for a very bad day for your airplane. So what does this have to do with avoiding application failures?
Keeping your plane at least “two mistakes” high means staying high enough so that you can recover from two mistakes made at the same time. Imagine you’re flying your plane and you make a mistake—you lose altitude. While you are trying to recover from the mistake, you have to do a number of tricky maneuvers, such as trying to level the plane out, slow it down, and turn it into the wind. These are critical tasks you need to perform to save your plane. What happens if you make a mistake while performing these tasks? You need to make sure you are still high enough so that this second mistake doesn’t result in a crash.
This same rule of thumb applies when building highly available, high-scale web applications.
A fatal mistake made in a high-stress moment
Say your application has a problem and your website goes down in the middle of the night. After getting paged, you find yourself in a war room with the impacted developers, product owners, and other team members, trying to figure out what to do. You try one thing, then another, then another, desperately trying to fix the problem that caused the application failure.
This is a high-stress situation, one in which it’s easy for people to make mistakes, including potentially catastrophic mistakes.
I was once in one of these war rooms when an engineer suddenly put their head down on the table and moaned, “Oh no …” You see, this engineer had just typed a command that was designed to fix a problem—but instead of typing the correct command, the engineer typed a command that caused a major failure of a critical database, making the entire situation much worse.
It was at that moment that our struggling “model airplane”—our entire company website—was in serious trouble and headed for a crash.
Keeping your applications “two mistakes” high
So how can you make sure you’re keeping “two mistakes” high when you’re running a modern digital application? To help avoid damaging application failures, you can start by making sure you have processes, rules, and procedures to use during critical problem scenarios that are designed to help the situation without introducing even worse problems.
- Don’t allow just one person to execute commands. During critical downtime responses, don’t allow a single engineer to execute any commands on any production system. Overly stressed engineers can easily make simple mistakes that can lead to even bigger problems. Instead, require all commands to be reviewed by at least one other engineer before they hit “enter” to execute the command. This simple two-step process can help your team avoid making catastrophic mistakes.
- Design a playbook and use it. Create standard processes and procedures (often called “playbooks” or “runbooks“) for solving various common problem scenarios. Making sure to use these playbooks during critical periods gives everyone clear steps to follow and reduces the likelihood of your team making additional mistakes.
- Avoid cascading or “double dependent” problems. Double dependent problems are problems that combine to make the situation worse than any of the individual single problems themselves. It’s like leaving your garage door opener in your car in the driveway overnight, then forgetting to lock your car. Either mistake by itself isn’t a big problem, but when they occur together, you are inviting big trouble. These can be a challenge to find, but are critical to resolve when you locate them, since they can cause small problems to become large problems quickly.
More tips on avoiding application failures
These are just a few ideas. Making sure to keep “two mistakes” high is just one of the lessons on high availability that I discuss in my book, Architecting for Scale, published by O’Reilly Media. If you are building highly scaled web applications and are dealing with availability challenges, this book can help you build the processes and procedures you need to keep your application—and your business—flying high.
More articles from Lee Atchison: