On Tuesday, June 8, 2021, there was a massive internet outage that brought down a significant number of websites and applications. Like many such outages, this one was caused by a relatively small web player, Fastly. Fastly provides cloud services and local caching for major portions of the internet. When it went down, the impact was felt throughout the internet.
As an application scales, it also becomes more complex. More scale and more complexity mean higher risk of a problem that could impact availability.
A well-known monitoring company suffered from serious availability problems while it was growing from a small to a midsize company. Its traffic was increasing dramatically, and its infrastructure couldn’t keep up. Worse yet, it didn’t always know when it was having a problem, and it certainly didn’t know when to expect the problems.
How do you avoid availability problems in your application? How do you mature your application as you scale so that you can meet your customers’ growing demand?
It’s not easy.
Improving availability is not about writing the correct code. Improving application availability is much more about improving the operational processes, procedures, and culture of your organisation in order to instill the practices necessary to maintain availability.
There are five steps involved that all companies can take to improve their application availability and reduce their risk of an operational problem.
Step 1. Know your risks
Many people do not realise how much risk is inherent in their applications. Much of this risk is in the form of technical debt in the code, but some of it is based on known decisions that were made about how the system should operate that implies outcomes that are unknown.
Donald Rumsfeld, the previous United States Secretary of State, famously said that there are “known knowns” and there are “known unknowns,” but that the problems to be concerned about are the “unknown unknowns”—the problems that we don’t know that we don’t know about.
Risk management is about removing the unknowns and making them knowns. In the case of modern applications, risk management is about identifying areas of concern, labeling them, quantifying them, and prioritising them. Then, addressing the risks that have the highest impact to our business.
To do this, each development team for each service in your application should create and maintain a risk matrix. A risk matrix is a spreadsheet that contains a list of as many issues and potential issues as possible. It’s a brainstorm by everyone with a stake in the service to identify as many risks as possible. Then, for each risk, they are assigned two numbers:
- A severity, which specifies how serious of a problem it would be for our business if this risk were to happen.
- A likelihood, which specifies how likely this risk is to occur.
A risk can have a high severity, but a low likelihood, meaning that it isn’t likely to happen, but if it does, the impact would be significant. It can have a high likelihood, but a low severity, which means the risk is more than likely to occur but won’t be a serious problem.
The most concerning risks are the ones that have a high likelihood and a high severity. They pose very serious problems to our business and are likely to happen. These are the highest impact risks.
The risk matrix provides a model for each team to prioritise their operational workload to understand what is important to work on and what is not important. Done correctly and consistently, it can be used to prioritise risks across teams and allow management to allocate resources to the greatest issues.
Risk matrices give visibility and prioritisation to technical debt and pending problems. They are a great communications tool between development teams and management.
Effective use of risk matrices will help reduce availability issues in your application.
Step 2. Watch your software
Understanding what your software and your operational infrastructure is doing at any given time is critical to maintaining high availability. Application and infrastructure analytics can give you insight into how your application is performing, allowing you to tune and optimise your operational environment, detect and resolve live operational problems, and understand who is using your software and how they are using it.
Used and set up properly, analytics can give early indications of pending availability problems, allowing you to fix an application or operational issue before it becomes an availability problem.
There are many free and paid systems and services that provide application and infrastructure metrics and analytics. All of them have advantages and disadvantages. Free systems are valuable for those who want to build and maintain their own systems, and even customise them to fit their particular needs. Paid systems can offer a more hands-off experience, but often require a significant financial investment. More modern paid systems even offer AI systems that analyse your application performance for you and give you early indicators of problems that you may not even notice among the depths of data available.
A full system to analyse your software provides the ability to:
- Monitor your system continuously to know how it is working.
- Examine changes in performance around deployments, to see if a deployment may have introduced a problem, or to verify a problem has been resolved.
- Inform you via notifications when anomalies of various sizes or shapes are detected, allowing you to look at deeper data to determine what might have gone wrong.
- Assist you in resolving an ongoing incident, using data that can help understand why a particular problem is occurring.
Analytics are also a great way to monitor service-level agreements (SLAs). This includes both public SLAs (those visible to customers) and internal SLAs (those that describe commitments between and among internal services). Analytics are a great tool for inter-team communications.
Step 3. Reduce your technical debt
Once you have analytics in place and you have identified your technical debt and other problems via your risk matrix and other tools, you need to evaluate and reduce your highest-impact problems. Knowing what your problems are is great, but it doesn’t help if you don’t work on reducing those problems.
If you have a high-severity, high-likelihood risk on your matrix that is driving availability issues, it must be fixed. But fixing it doesn’t necessarily mean rewriting to remove the risk. You can resolve the availability issue by reducing either the severity or the likelihood of the risk.
In other words, if you can’t easily remove an issue that’s causing you problems, then either make the issue happen less often—so that it’s not a frequent source of concern—or reduce the impact of the problem when it does occur by reducing the severity. Either way, the end result is that the problem is no longer a major driver. It may still be a recognised risk, but the reduced frequency or reduced impact makes it no longer a critical concern.
Having a regular focus on technical debt helps keep availability in line. But be careful you aren’t looking for perfection. Your goal should never be to remove all technical debt, and hence remove all risk. Unless you are building the control software for an airplane, rocket, or similar system, you need to balance effort with the impact of the problem. Focusing on reducing technical debt too far may indicate that you are spending too much time focusing on “perfecting” software at the cost of some other business opportunity.
Step 4. Automate recovery as much as possible
When an incident does occur, how long it takes to recover can have a huge impact on your overall application availability. It’s important to recover fast. It’s also important to correctly diagnose the problem and take steps to ensure it doesn’t occur again.
When an availability incident happens, the response generally involves the following steps:
- You notice that a problem is occurring (either you detect the problem, or a customer reports the problem).
- You analyse what’s causing the problem.
- You roll out a remediation to reduce or eliminate the problem.
- You implement a permanent fix, if necessary.
- You hold a post mortem on the episode.
This same sequence of events occurs every time there is an event. The problem is this process takes time. The time between when the problem occurs, or when it is first noticed, and when a remediation is put in place to remove the problem is called the mean time to repair (MTTR). The longer your MTTR, the lower your availability. Because humans are involved in diagnosing and fixing the problem, your MTTR can be quite long, impacting customer satisfaction.
However, sometimes you are aware of certain types of problems that can occur, and the process to fix the problem can be quiet and automated. By automating the repair of these types of problems, you can dramatically improve your MTTR.
A classic example of an automatable repair is when a computer instance goes offline. This can happen due to a software problem, a network problem, or another cause. But monitoring software can detect when the instance stops responding, and the instance can be immediately rebooted. Or, in the cloud, the instance can be terminated and replaced with a new instance. This can occur automatically. Because a human doesn’t have to be involved, your MTTR for this class of problem can be reduced, which can improve your availability markedly.
Step 5. Try and break things regularly
The best way to keep your application operating is to try and break it regularly.
Yes, that’s right. You heard me correctly.
The operators of the biggest applications in the world regularly test their resilience to problems by trying to break their application regularly.
The idea is this: Your software will fail. But do you want it to fail in the middle of the night or at a critical time operationally? Or would you rather have it fail at a more opportune time, with your engineers looking on and ready to detect and fix the problem quicker?
In either case, you gain valuable experience on how your application operates. In the first case, you provide a bad experience and potentially long-lasting damage to your customers while you try and figure out what’s wrong with the application. In the second case, you know what caused the problem (you caused it) and you can quickly fix it. Your learnings are the same, but the costs of the lessons are far less.
There are two common ways to accomplish this production operation testing. The first is called game days. Game days are scheduled times when you inject specific failures into your operational infrastructure, in order to see how the problem manifests and how quickly you can detect and fix the problem. A common game day test scenario, for example, is to bring down an entire data center to see if your application can fail over to a backup data center.
The second common method of production operation testing is called chaos testing. Chaos testing involves having a software system operating that, randomly and unpredictably, breaks parts of your system on a regular basis. This might involve crashing a server, breaking a network link, or taking a load balancer offline. Chaos testing is a great way to test automated recovery mechanisms and prove the safety and efficacy of your recovery processes.
In either case, the goal is to identify problems in a controlled manner, learn from the errors, and improve the quality of your application to be able to self-repair from these failures. The twin goals of both approaches are to improve your operational reliability and improve your application availability.
Improve processes, improve availability
Improving application availability is not about striving for perfection or eliminating every risk. It is much more about improving your operational processes: working to reduce the severity and likelihood of problems, closely monitoring applications and infrastructure, keeping technical debt in check, automating recovering mechanisms, and regularly putting those recovery mechanisms to the test. Follow these steps, and your application availability will be markedly improved, your customers will be happier, and happier customers will mean more business for your company.