What do model airplanes have to do with avoiding application failures?

This is Tech Tapas Tuesday, a “little bit of tech”.

Today on Modern Digital Business.

{{useful-links-research-links}}

{{about-lee}}

{{architecting-for-scale-ad}}

{{signup-dont-miss-out}}

Transcript
Lee:

What do model airplanes have to do with avoiding application failures?

Lee:

It's Tech Tapas Tuesday, let's go.

Lee:

I learned to fly radio controlled airplanes when I was a kid.

Lee:

And one of the most important rules.

Lee:

I remember was always keep your airplane, at least two mistakes high.

Lee:

You see when you're learning to fly a model airplane, especially when

Lee:

you begin to attempt acrobatics, you learn this lesson quickly

Lee:

because mistakes equal altitude.

Lee:

You make a mistake, you lose altitude.

Lee:

As you can imagine, losing too much altitude makes for a very

Lee:

bad day for you and your airplane.

Lee:

So what does this have to do with avoiding application failures?

Lee:

Well, keeping your plane, at least two mistakes high means staying high

Lee:

enough so that you can recover from two mistakes made at the same time.

Lee:

Imagine you're flying your plane and you make a mistake.

Lee:

You lose altitude.

Lee:

While you're trying to recover from the mistake, you have to do a

Lee:

number of tricky maneuvers, such as trying to level the plane out, slow

Lee:

it down and turn it into the wind.

Lee:

These are critical tasks you need to perform to save your plane.

Lee:

What happens if you make a mistake while you're performing those tasks?

Lee:

You need to make sure you are still high enough so that the second

Lee:

mistake doesn't result in a crash.

Lee:

The same rule of thumb applies when building highly available

Lee:

high-scale web applications.

Lee:

Say your application has a problem and your website goes

Lee:

down in the middle of the night.

Lee:

After getting paged you find yourself in a war room, what the impacted

Lee:

developors, product owners and other team members trying to figure out what to do.

Lee:

You try one thing, then another, than another desperately trying

Lee:

to fix the problem that caused the application failure in the first place.

Lee:

This is a high stress situation.

Lee:

One in which it's easy for people to make mistakes, including

Lee:

potentially catastrophic mistakes.

Lee:

I was once in one of these war rooms, when an engineer suddenly put their head

Lee:

down on the table and moaned, oh no!

Lee:

You see, the engineer had just typed a command that was designed to fix

Lee:

a problem, but instead of typing the correct command, the engineer type the

Lee:

command that caused a major failure of a critical database, making the

Lee:

entire situation substantially worse.

Lee:

It was at that moment that our struggling model airplane, our entire company's

Lee:

application, our entire reason for existence as a company was in serious

Lee:

trouble and headed for a crash.

Lee:

So how can you make sure you're keeping two mistakes high when you're

Lee:

running a modern digital application?

Lee:

To help avoid damaging application failures, you can start by making sure

Lee:

you have processes, rules, and procedures to use during critical problem scenarios

Lee:

that are designed to help the situation without introducing even worse problems.

Lee:

For example.

Lee:

First during critical downtime responses, don't allow a single lone engineer to

Lee:

execute commands on any production system.

Lee:

Overly stressed engineers can make simple mistakes that can lead to

Lee:

even bigger problems instead require that all commands are reviewed by at

Lee:

least one other engineer before they submit the command to be executed.

Lee:

The simple two-step process can help your team avoid making catastrophic mistakes.

Lee:

Second, create standard processes and procedures for solving

Lee:

various common problem scenarios.

Lee:

Often these are called playbooks, or runbooks make sure to use these

Lee:

playbooks during critical period.

Lee:

This gives everyone clear steps to follow and reduces the likelihood of your team.

Lee:

Making additional mistakes.

Lee:

And finally, look for and avoid cascading or double dependent problems.

Lee:

A double dependent problem is a set of problems that combine to make the

Lee:

situation worse than any of the individual single problems were themselves.

Lee:

It's like leaving your garage door opener in your car in the driveway overnight,

Lee:

then forgetting to lock your car door.

Lee:

Either of those two mistakes by itself, isn't a big problem.

Lee:

But when they occur together, you're inviting big trouble.

Lee:

Finding double dependent problems can be a challenge, but when you do locate

Lee:

them, they're critical to resolve since they can cause small problems