Kurt is a software engineer focused on making products that work for people. He also loves Aikido, writing graphics code, playing piano, cooking and reading every book he can get his hands on.
A Journey to Get Things Done And Understand Why
Anyone, even a slow moving company, breaks things from time to time. If you really foul up, you can break something integral that is live to customers. Some would call this a Severity One or a Sev1.
This post is not about how to deal with the problem, but how to deal with the aftermath of a problem. Companies deal with problems according to their culture. In times of stress, those at the top are the arbiters of not just how to deal with problems but how to course correct as well so people don’t just run into another brick wall further down the road.
The most important thing that I’ve found is this:
The outage/severe bug/leak was caused by the current situation.
The situation that the incident occurs are made up a large number of factors but can include:
Much of this insight can be revealed just by leadership introspecting but at some point, you will have to look outward. At this point, it will be a good idea to search for answers with ego-less and blame-less questions.
Make your inquiry about what went wrong, not who went wrong.
Two things I like to keep in mind when trying to understand what happened is to remember that:
Once you know what’s wrong, the first instinct can be to talk about behavior in a negative aspect. Try to avoid this.
Frame discussions about the incident as a situation we can improve and push developers to brainstorm ways to make the system better.
Gather developers into a postmortem meeting and let them know what happened. Frame it in passive voice to avoid pointing blame since those involved have already either blamed themselves or mentally shifted that blame. Let people throw out ideas (but prevent it from digressing). Gather the highest leverage ideas ( impact / time ) and create tech debt tasks to be accomplished.
It is important to recognize that these typically break into two categories; Preventative and Reactive.
Both have their trade-offs.
Preventative measures safeguard developers from making poor decisions and short-circuit the process when they do. Examples include well known methods like unit testing, linting, build validations, etc.
Pro: A mistake prevented is one you don’t have to fix.
Con: They tend to pad the development process, especially if, for example, the developer isn’t notified that their unit tests or linting or deployment failed in a way that is quick and obvious.
Reactive measures allow for the quick and graceful recovery after errors have been made. This could be messages from a log aggregator about outstanding errors, emails from aws about a network partition, or data integrity messaging in slack from an internal bot.
Pro: Corrective action, especially if sited in the message, can be swift and non-disruptive. If you break a form in the ui and fix it 5 minutes later, it is very unlikely to have hurt user’s trust in the platform.
Con: If these measures become too numerous or too frequent, they almost completely lose their value. At that point, tribal knowledge is the only thing that allows us to know what is actionable.
Choose a combination of preventative and reactive measures to prioritize to both prevent and recover from similar incidents in the future and prioritize them.
We all know that throughput on features is important, especially if you’re optimizing for market fit. That being said, what you don’t prioritize sends messages as loud as the tasks that you do prioritize.
Even if its only 20% or 10% of a sprint, make sure that your teams know that the feedback they provided you is being turned into actionable tasks and results. Just patting yourself on the back for responding to and analyzing an incident is pretty easy. Getting recovery measures implemented, finding ways to track that they work, and recognizing the people that came up with them is hard.
Finally, review what you did to contribute to the situation that caused the incident.
Reflect on what you did and did not do (inaction is just as important a cause as action) to contribute to it and think of ways to improve for the future. Don’t apologize for what happened but do voice to the incident stakeholders what you can do to improve the situation for the future in regards to the cause of the incident.
Mileage may vary for if or how you want to present this information.
Remember, the key to all of this is to not let incident reaction be an emotionally driven affair. That’s what happens when we don’t have a clear and concise strategy for dealing with these situations.