Learning from Your Incident Response to Improve Availability



No matter how smoothly your services normally run, outages can happen to the best of us. The truth is, that occasional incidents are unavoidable. Dealing with those incidents is both an art and a science, and there are many products, systems, and procedures that can help you create incident response processes to help reduce the impact of incidents when they do happen to your application. 

But what about after the incident? What then? Once an incident is finished, it’s just as important to follow up on why the problem occurred, along with how the incident was managed.

This is what we refer to as the postmortem phase of an incident.

The value of a postmortem 

The postmortem is a critical but often overlooked step in improving the availability of your modern digital application.

After all, the worst type of incident that could happen is one that already happened before and was entirely avoidable. It’s hard enough to maintain high availability when unknown things happen. It’s impossible to maintain high availability when the same problems keep happening over and over again.

We must learn from our failures. This is why the postmortem is so important.

 

But, what’s the best way to handle a postmortem? I can tell you that, having seen a fair number of them handled in the different companies where I’ve worked, there are as many ways to conduct an incident postmortem as there are postmortems needing to be done.

In other words, there really are no good, reliable, or defendable best practices for doing a postmortem.

Startup Jeli brings new approach

This is why I was so excited to first hear about Jeli.io, a startup that strives to solve the problem of developing best practices for postmortems by creating a richer post-incident analysis process.

How is the company planning on doing this? Simple, by using data relationships. Jeli’s secret sauce is data and data relations. It brings in data about the incident from a variety of sources—Slack, GitHub, incident management tools—and allows an investigator to look through this data and create correlations and relationships between the pieces of data. Ultimately, it creates a timeline of actions that occurred during the incident, along with an actionable, related dataset that describes the incident and how it was handled.

This data is very useful in analyzing what caused a particular incident and how to resolve the issues presented. This by itself is valuable. But the real value that I see from Jeli is that this data can then be used for trend analysis in order to help determine root causes, along with unseen (and normally undetectable) faults in either the application or the incident management processes.

The future of incident response

What really excites me is thinking about where all this could eventually lead down the road. Imagine applying machine learning to the dataset, and what could be discovered by this type of analysis of the data—both for a particular company as well as for the industry as a whole.

Imagine, for instance, being able to determine what the most critical aspects of an incident response are by examining the mood and sentiment of the participants involved, which could be determined with a ML analysis of discussions on Slack and other communication platforms to determine user intent. Imagine being able to properly train and retrain incident responders by examining where in the process they are the most effective and where they are least effective in bringing incidents to conclusion?

We need a lot more ML training data before we can do this sort of analysis, but I believe Jeli is off to a good start at being able to gather this data, while providing more immediate help with companies in handling immediate postmortem incident evaluations.

Jeli is just starting out, but I believe this is the beginning of a new way of thinking about incident management, and a start down the path of standardized data analysis for incident management. The future value of this is exciting. According to Jeli, its goal is to make the entire post-incident analysis process richer. Definitely a company to watch!

More articles by Lee Atchison:

 

Image by Thomas Breher from Pixabay.