Fork me on GitHub

When Things Go Wrong

Sometimes things go wrong at work. Bad things happen. And our jobs demand that we endeavour to avoid such things happening again.

In my job, things like system outages are bad things. And when they do happen, we apply things like Five Whys to get to a root cause. Usually not a single cause, but many causes. It is usually a mistake to assume that there is only a single cause triggering the problem - the world is a complex place. Any decent investigation should be able to come up a number of different recommendations addressing various underlying causes.

I came across this article Why You’ve Never Been In A Plane Crash, which made me think. Trust me: The article is worth reading. (I hope you'll return to this page afterwards.)

The article highlights one very important aspect of post mortems (the technology world's equivalent to "crash investigations"): The attitude.

The atttitude is a reflection of the purpose of the investigation: the purpose is to learn. Not to blame, assign liability or culpability. No finger pointing allowed. Only by learning the underlying causes (I deliberately use the plural here) can we come up with ways to prevent bad stuff from happening again. Or at least: ways to reduce the risk.

To effectively do that, we need people to be open about mistakes - even about their own mistakes.

I was very impressed by the controller in the accident in the article: Once she came to the realisation that she had made a huge mistake, she had the personal integrity to help the investigation - despite knowing that her mistake had undoubtedly killed people (the final death toll which was not known at the time came to 35 lives).

Athough I cannot be sure, I think the reputation of crash investigations helped here: If the investigations had a track record of assigning blame (leading to punishments), then she may have been less likely to come forward so readily. People with less integrity might even have attempted to cover things up or hamper the investigation in other ways. Or to run away. And the world would have been a less safe place as a result.

In other words: we learned something from the crash because the investigation did not try to assign blame or point fingers. Yes: we might have learned anyway, but it would have taken longer.

The same applies in the IT world. Especially where it may be far easier for people in central positions in a forever-understaffed-and-overworked department to hide things. We do not keep records the way air traffic controllers do - things are scattered about and (with the right privileges) highly placed people may be able to delete log files, emails, chat history and more.

To make things worse: every company is different: To understand a failure, you need to have a good understanding of the systems that company uses. Which invariably includes the software they write for themselves. So using "outside investigators" is a pipe dream. Unless you have deep pockets full of money which you somehow failed to spend on good people & processes instead.

It doesn't matter that (generally) lives are not at risk in my line of work. We still want to get to the underlying causes and address them.

So I take pride in deep investigations: They are learning opportunities.

When Things Go Wrong Recursively

There is a nice meta level to this which I find especially curious:

  • Processes sometimes go wrong (like air traffic control). Sod's law and Murphy's law apply to everything.

  • We know we want to learn from failures, so we investigate and come up with recommendations.

  • The investigation itself is a process. Thus it is prone to failure. Sod's Law and Murphy's law come into effect again!

  • So the investigation process was changed to mitigate one failure mode (people not being open) by changing the attitude. Which undoubtedly resulted in improved investigations & recommendations

So: we are working on improving the processes by which we improve the processes.

But they are prone to failure. So we have to work on improving the processes by which we improve the processes by which we improve the processes.

But they are prone to failure. So we have to work on improving the processes by which we improve the processes by which we improve the processes by which we improve the processes.

Repeat ad nauseum. You can see where this is going: Thanks to Sod's Law & Murphy's law applying at every level, the recursion is unbounded.

I can't help but think this infinite recursion is itself a sign of something-gone-wrong, and we can learn something from that... There's something there I cannot quite put my finger on...

I'll keep thinking. And cursing the recursion.