Observations on Blame Cultures and the S3 Outage

One would think that this was scripted the way it happened, but I can assure you, that was not the case.  I’m in the middle of reading a book (a really good book on blame cultures; I highly suggest a copy:  Here).  The day after I finished reading the book, my tech social media feeds were aflame with mentions of problems with AWS (specifically in the S3 service and in the US-East-1 region).  Much has been said about the need for proper application architecture using cloud building blocks and much reflection on whether the cost of this resiliency is worth a significant outage.  I fully expect that there’s plenty of discussions happening within organizations about these very factors.

I found myself not necessarily focused on the incident itself.  I was more interested, strangely enough, on any sort of public post-mortem that would be brought forth.  Having read many DevOps books recently, the concept of a public post-mortem is not new to me, but I can guess that for many private organizations, this could seem like a foreign concept.  When an incident occurs, many in the organization just want the incident to go away.  There’s an extreme negative connotation associated with incident and incident management in many organizations.  To me, post-mortems give me great insight into how an organization treats blame.

Recently, I’ve been doing quite a bit of research into how organizations, specifically IT organizations, deal with blame.  Now, in Amazon’s case, they’ve listed “human error” as a contributing cause to the outage.  What comes after this, in the post-mortem, goes to show how Amazon handles blame internally.  The two quotes, taken from the post-mortem (available here:  https://aws.amazon.com/message/41926/) are telling in how this was handled internally:

I’ve put in bold my key terminology of this event.  Notice that outside of one mention of the authorized S3 team member, every other mention has something to do with the tools to perform that action or in the process that would have helped to prevent the issue.  In this case, the root cause is NOT the operator that entered in the command, it was the process that lead to the input and the associated actions the system took based on the runbook operation.

At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks

So, why the long-winded breakdown of the S3 post-mortem?  This got me thinking about all the organizations that I’ve worked for in the past and made me realize that when it comes to any sort of employment change, especially one that requires on-call or primary production system ownership, I’ve got a perfect question to ask of a potential employer.  Ask that employer about their last major internal incident.  While you might not get a full post-mortem, especially if the organization doesn’t believe in the benefit of such a document, key information about the incident and the handling of the incident should become immediately prevalent.  If the incident was a human error, ask about how the operator that performed the action was treated.

Unfortunately, in many IT organizations, the prevailing thought is that a root cause can easily be established that the operator was incapable of performing their role and immediate termination is a typical reaction to the event.  If not immediate termination, you can be rest assured that the organization will forever assign a hidden asterisk to your HR file and the incident will always be held against you.  Either way, this sort of thought process is what ends up causing more harm to the organization, long term.  Sure, you think you removed the “bad apple” from the mix, but there’s going to be collateral damage in the ranks of those that still must deal with the imperfect technical systems that still need their “care and feeding” to be able to function optimally.

Honestly, if this is the sort of response you get from a potential employer, I would end the interview at that time and no more discussions with that organization would take place.  Based on their response to the incident, you can easily see that:

  • The organization has no real sense or appreciation for the fact that the technical systems IT staff works with on a day-to-day basis are extremely complex
  • Those systems are also designed to be that where updating or changing the system is considered a mandatory operational requirement
  • When change occurs, you can never guarantee the desired outcome 100% of the time. Failure is inevitable.  All you can do is mitigate the damage failure can do to the system in question.
  • Reacting to the incident by putting the entire root cause on the operator is a knee-jerk reaction and occludes you from ever getting to the root cause(s) of your incident
  • By levying a punishing of termination onto the operator in question causes a ripple effect to the rest of the staff. The staff are now less likely to accurately report incident information, out of fear (of employment, being branded a “bad apple”).  This leads to obscuring root causes, which then leads, ultimately, more failures in the system.

Are you sure you want to work for an organization that puts its pride on “just enough analysis” and breeding a culture of self-interest and self-preservation?  No, me either.  Culture matters in an organization and with those seeking opportunities in that organization.  It’s best to figure out what the culture really is before realizing you made a major mistake working for an organization that loves to play the name/blame/shame game.

Advertisements

About snoopj

vExpert 2014/2015/2016, Cisco Champion 2015/2016. Virtualization and data center enthusiast. Working too long and too hard in the technology field since college graduation in 2000.
This entry was posted in Technical and tagged , . Bookmark the permalink.

One Response to Observations on Blame Cultures and the S3 Outage

  1. Pingback: Observations on Blame Cultures and the S3 Outage | thechrisshort

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s