A Survivor’s Guide to Cloud Outages
When it comes to hosting cloud services, there is one ever-present worry. Will the cloud go down? It’s bound to happen at some point although it should be a very rare occurrence. In the current high tech world of service uptime agreements that range from 99.9% uptime all the way up to 99.999% uptime, reaction to cloud outages is critical. The way you react can mean the difference between a graceful recovery or highly upset customers. There is no way to positively say that an outage will never occur. So let’s discuss how the situation could be handled so that your team and your company minimize the impact.
The most important thing to remember is that it is imperative to remain calm. The battle for control over the situation has begun and any great leader is great because they have the ability to remain calm in highly stressful situations. If you lose your cool, the next thing that will get lost is the ability to make rational decisions. Human beings are not effective at making decisions while in a state of panic. As a leader, remaining calm sends a strong message to the team. You are ready to handle the situation head on and in the end, everything will get resolved. Calm confidence is key. It will be necessary to communicate effectively with your team as well as stakeholders outside of your department while trying to resolve the situation. Remaining calm will allow your communication to remain clear and effective. This is absolutely critical when you have a service agreement that only allows for a short amount of downtime. Remaining cool and collected could mean the difference between a short blip and a prolonged nightmare.
Next, use this outage as an opportunity to build team morale and camaraderie. Make sure everyone on the team understands the situation and is able to assist in a way that highlights their strengths. You will need your key players to step up and make quick decisions. You will need to make sure they are also communicating clearly so that you know exactly what the options are and can make the appropriate decisions to mitigate risks. Once you figure out the plan of action, look for any other members of your team that have the necessary skills to help in the recovery. This allows for a strong recovery, but most importantly gives the team an opportunity to rise to the occasion. When a failure occurs, it’s important that you do not put the failure on the team. As a leader, the failure is your responsibility. Conversely, when the team rises to the occasion and solves the problem, let them take the credit. While the moral will suffer a little during the initial onslaught, the team who rises to the occasion and tackles the challenge together will be infinitely stronger going forward.
Once the systems are back up and restored to full operational status, there is still one step that is often ignored. It is easy to just go about the rest of the day, business as usual. However, you need to make sure you reflect on what happened and see how you and your team performed. What went well during the recovery? What did not go so well? Maybe there needs to be some additional alerting added to the infrastructure. Maybe some existing alerts were proven to not be effective. Maybe the alerts were appropriate but the right people were not seeing them. Create an outage report card and fill it out after each outage. The questions on the report card should not target just engineering either. How did support handle the customer calls? What about sales? Most importantly, you need to document what exactly went wrong and how it was fixed. This will undoubtedly come in handy if the same type of outage occurs again. Take the answers to these questions and find ways to improve your process so the next cloud outage is handled even better, or entirely prevented.