Over the last 10 years, I have dealt with numerous site outages. These have been for highly trafficked, consumer facing sites like cnet.com, cbsnews.com and zoosk.com. A site outage can be a very stressful, unpleasant experience, particularly when you are in the thick of investigating the issue. Unfortunately, as a VP of Engineering, these come with the territory. In this post, I’ll offer some guidelines on how to deal with the outage and what to do afterwards.
Maintain your Perspective
If you don’t experience a site outage or service degradation at some point, then you probably aren’t shipping fast enough. For a long time, Facebook espoused the slogan “Move fast and break things”. They eventually updated that to promote more stability. Point being that unless you stop making production changes completely and turn off all site traffic, there is always a risk of a site outage. At the same time, your company’s executives and employees have entrusted you and your team to minimize disruptions to your company’s revenue stream. So, we should take outages seriously.
An outage is not going to be the end of your career and I have never heard of a VP Eng getting fired over a single outage. I have actually found that outages (once they are fixed) can be great learning experiences and opportunities for your team to improve your infrastructure, processes and organizational structure. So, first and foremost, remain calm. In the heat of battle, be very careful about how you act and what you say. Your team will look to you to see how to react. Any negative actions will be magnified many times over. Maintaining a composed demeanor will go a long way.
What is the Scope?
First, you need to understand the nature of the outage. Here are some questions to consider:
- How did you hear about it? Hopefully, you have a monitoring system in place that is continuously testing your service for basic availability checks (Pingdom, Dynatrace, etc.). If the reports of an outage are coming from customers or employees, it’s important to understand where they are located, how they are accessing the service and what behavior they are observing. If the report is from a single customer or employee, it could be an isolated issue.
- Has the outage been verified? Along the lines of the first question, have you been able to verify the outage report? I have seen cases where a site monitoring service actually had an issue and missed a check. Once, multiple employees reported that our site was down, only to discover that the office lost internet connectivity. Granted, these are obvious examples, but it’s important to ensure your outage reports aren’t false before mobilizing the team.
- How extensive is the outage? Once the outage has been verified, you want to determine how extensive it is. Are all users experiencing it, or a subset? If a subset, are they cut by geographic location, user type (paid versus unpaid), device type (mobile versus web), etc.? Does the outage affect all features on an app, or only a few. Your monitoring system should also be able to help isolate this.
- How long has it been happening? Your monitoring system should record the time of the first failure. If the outage report is coming from users, when did they first experience the issue?
- What general systems are involved? Does this seem like a network connection issue, are your web servers all overloaded, or did a database go offline? Your monitoring tools should give some indication of what systems may be causing the issue. This can sometimes be difficult to isolate initially, but try to get some clarity.
Assemble the Team
Once you have an understanding of the scope of the issue, you can determine who needs to be involved to fix it. You want a knowledgeable representative from every area that seems to be related. At this point, you don’t have a root cause. Your monitoring system hopefully is indicating what subsystem(s) is being affected, but this is not a guarantee of an issue with that system exclusively. For example, if your monitoring system reports that a database is overloaded, you would want to involve the DBA. However, they may quickly discover that the database is overloaded because a recent code change is sending too many queries to the database. So, the DBA won’t be able to fix the issue, and a developer needs to be contacted.
Once you have an idea of who needs to be involved in issue resolution, make sure they are notified. The on-call Ops person would likely have already been paged. You may also have an escalation tree in place, which will indicate who from other teams should be available. Work through that list and bring those representatives online.
For simplicity, I’ll refer to the individuals involved in the issue investigation as the issue resolution team. You should designate one person as the issue resolution manager. This is usually a leader who has a holistic understanding of your systems. They function as the quarterback, determining tests to run, forming a plan and sharing updates across the issue resolution team. The issue resolution manager can be the VP Engineering, but it is usually better performed by another manager on the team, perhaps from DevOps or engineering. As VP Eng, you need to communicate status to the rest of the organization, ensure the right people are involved and coordinate activities with other departments. You also provide cover for the issue resolution team, so that they can remain focused. Other anxious Execs may ping them directly for updates or questions.
Keep in mind that if you are experiencing an outage, then time is of the essence. I usually err on the side of including more people than less on the issue resolution team. The key at this point of the outage resolution process is finding and fixing the root cause. Having more people gathering data, sharing observations and providing theories is usually helpful. I try to assemble a core team, with a representative from each subsystem that may be impacted – like development, sys admin, network and data. If you host on the cloud, this may just be someone from devops and related software engineers. I also try to give other people who might be pulled into the investigation a heads-up – something like “We are having a site issue. We are still gathering data, but may need your assistance to resolve. Are you near a computer, and can you monitor the situation?”
Once you have everyone online who can investigate/address the issue, you need to establish a communication channel. The mechanism for doing this should be established beforehand. At minimum, there should be a group chat channel open for sharing information and posting updates. This can be on IRC or a collaboration tool like Slack. I have also found a telephone conference call to be very efficient. I realize engineers dislike this, but most people can talk faster than they can type. Usually, the phone teleconference is used for coordinating tasks, providing updates and making plans. The digital channel is used for posting technical information, log output, links to graphs, commands run, etc.
Once the outage resolution team has a plan and is executing on their investigation, you should communicate the outage to the rest of the organization. I usually employ email distribution groups for the teams that need to be aware. I would include at minimum – Exec team, customer service, product, marketing, legal and engineering. Of course, if your company is small, you can just email the “all” employee list. You can post in company collaboration channels as well. I try to keep the communications short during the issue investigation process. Key points are scope of the outage, timeframe, teams involved, latest status of the investigation and planned time for next update. These communications are important so that everyone who needs to know is aware of the outage. Also, proactive communications will prevent your team from being distracted by folks looking for an update.
Finally, try to keep the issue resolution channel clear and focused on issue investigation. Sometimes other department heads may find your outage communication channel and use that as a means of gathering more information about the outage or an expected time to resolution. This can distract your investigation team, as they try to be helpful. Politely intercede in these cases, and take the conversation offline.
As the issue investigation proceeds, there is a possibility that the team runs out of ideas to discover the root cause of the issue. It is the issue resolution manager’s job to keep the team moving forward, directing more data collection where needed and postulating causes. If the team runs out of ideas, and the outage is dragging on, here are some additional suggestions you can make:
- Was there a change recently that would explain this behavior? In over half the outages I have managed, the root cause could be traced to a code change or planned infrastructure maintenance that occurred within the last 24 hours.
- Are there tests we can run to better understand the behavior? Since the site is down, there isn’t much harm in making other changes to test out a hypothesis. What additional data points might help provide a clearer picture of why the outage is occurring?
- Is there a way to restore some functionality to your apps, by circumventing the system causing the outage, or removing a dependency on it? If the system causing the issue provides content for just a portion of your apps, can you temporarily remove/turn off this content in order to get the rest of the functionality back online? The investigation would continue, but this approach might take some pressure off the team.
- Have we examined all the data sources we have? Sometimes the root cause of the issue is available in a server log that hasn’t been examined. Determine the timestamp when the issue started and then review all server logs for any events that occurred at that same time.
Issue Resolution and Verification
If the issue is resolved and service functionality is restored, do you have a clear explanation of root cause? This is very important, as not having a good understanding of root cause means the issue could happen again. I have experienced cases where service restored itself and the team dispersed, only to have the outage repeat an hour later. On occasion, the conditions that explain the root cause disappear once service is restored. This makes it more difficult to identify the cause. Having service restored also makes it more difficult to keep tired team members focused on finding the root cause. But, as the leader, you need to remain on point.
You should also verify with any third parties, customers or employees who were experiencing an issue during the outage that they agree service has been restored. I have seen occasions where the issue isn’t fully resolved, or some additional change needed to be applied (like clearing a cache) before all customer issues are fully addressed.
Once the issue is resolved and you have a clear explanation of root cause, you can dismiss the issue resolution team. You should send out a final communication to the broader organization. This email should be longer. I usually send a quick email first, stating that the issue has been resolved and that I will follow up with a more lengthy explanation. For the longer explanation, include a full summary of the issue, time frame and impact. Detail the root cause and how it was fixed. Be very objective in this broad email and don’t assign any blame. Also, thank the issue resolution team for their efforts.
Finally, a day or two after the issue occurred, make time for a full post mortem. I like the philosophy and format shared by Etsy, called a Blameless Post Mortem. Basically, you want to emphasize that mistakes happen and the key is to focus on what improvements can be made to prevent the issue from happening again in the future. Be objective – not judgmental, treat everyone with respect and use the event as a learning opportunity.
For the post mortem, you should book a room with a large white board and sufficient seating. You should invite all the core members of the engineering team who were involved in the outage, and also ancillary teams like product, customer service, etc. You can even invite members of the Exec team to sit in – this helps promote openness and transparency between departments. Assign a facilitator for the meeting (this can be you or the outage resolution manager). The facilitator will guide the group through the post mortem process, capture items on the white board and publish a summary after the meeting. It is also helpful to have someone put together a sequence of events and timeline before the meeting and share that with all participants. In the post mortem, you should review the following information:
- Briefly review the sequence of events and timeline, allowing anyone to edit or add an item.
- Create two columns on the white board. One column is for Issues and the other is for Improvements.
- Make a list of the system or process failures that contributed to the outage in the Issues column. Be objective with these and don’t use people’s names (“John dropped a database table accidentally”). These items are usually linked to the root cause.
- Once all the Issues have been delineated, make a second list of things that could be done to prevent the issues from occurring. These are usually process updates, but can also represent new/different changes to infrastructure. Try to be exhaustive in making this second list. There may be additional improvements that the team suggests, which aren’t directly related to the outage.
Try to assign an owner for each of the Improvement items. Get these added to your sprint planning backlog. After the meeting, the facilitator should publish the output of the post mortem and distribute to all participants.