5 metrics to assess enterprise back-up and recovery systems
- 31 March, 2020 05:10
Finding out whether back-up and recovery systems work well is more complicated than just knowing how long back-ups and restores take; agreeing to a core set of essential metrics is the key to properly judging systems to determine if it succeeds or needs a redesign.
Here are five metrics every enterprise should gather in order to insure that their systems meet the needs of the business:
Storage capacity and usage
Let's start with a very basic metric: Does the back-up system have enough storage capacity to meet current and future back-up and recovery needs? Whether talking about a tape library or a storage array, the storage system has a finite amount of capacity, and CIOs need to monitor what that capacity is and what percentage of it they're using over time.
Failing to monitor it can result in being forced to make decisions that might go against company policies. For example, the only way to create additional capacity without purchasing more is to delete older back-ups. It would be a shame if failure to monitor the capacity of a storage system resulted in the inability to meet the retention requirements a company has set.
Cloud-based object storage can help ease this worry because some services offer an essentially unlimited amount of capacity.
Throughput capacity and usage
Every storage system has the ability to accept a certain volume of back-ups per day, usually measured in megabytes per second or terabytes per hour. CIOs should be aware of this number and ensure they monitor a back-up system’s usage of it. Failure to do so can result in back-ups taking longer and longer and stretching into the workday.
Monitoring the throughput capacity and usage of tape is particularly important. It is very important for the throughput of back-ups to match the throughput of a tape drive’s ability to transfer data. Specifically, the throughput that CIOs supply to their tape drive should be more than the tape drive’s minimum speed.
CIOs should consult documentation for the drive and the vendor’s support system to find out what the minimum acceptable speed is and try to get as close to that as possible. It is unlikely that they'll approach the maximum speed of the tape drive, but they should also monitor for that.
Compute capacity and usage
The capability of a back-up system is also driven by the ability of the compute system behind it. If the processing capability of the back-up servers or the database behind the back-up system is unable to keep up, it can also slow down back-ups and result in them bleeding into the workday. CIOs should also monitor the performance of the back0up system to see the degree to which this is happening.
The previous two metrics are very important because they affect what is known as the back-up window: the time period during which back-ups are allowed to run. If CIOs are using a traditional back-up system where there is a significant impact on the performance of primary systems during back-up, they should agree in advance what the back-up window is.
If CIOs are coming close to filling up the entire window, it’s time to either reevaluate the window or redesign the back-up system.
Companies that use back-up techniques that fall into the incremental-forever category (e.g. continuous data protection (CDP), near-CDP, block-level incremental back-ups, or source deduplication back-ups) don’t typically have to worry about a back-up window.
This is because back-ups run for very short periods of time and transfer a small amount of data, a process which typically has very low performance impact on primary systems. This is why customers using such systems typically perform back-ups throughout the day, as often as once an hour or even every five minutes. A true CDP system actually runs continuously, transferring each new byte as it’s written.
Recovery point and recovery time reality
No one really cares how long it takes to back-up; they care how long it takes to restore. The recovery time objective (RTO) is the amount of time agreed to by all parties that a restore should take after some kind of incident requiring one.
The length of an acceptable RTO for any given company is typically driven by the amount of money it will lose when systems are down. For example, if a company will lose millions of dollars per hour during downtime, it typically wants a very tight RTO.
Companies such as financial trading firms, for example, seek to have an RTO as close to zero as possible. Other companies that can tolerate longer periods of computer downtime might have an RTO measured in weeks. The important thing is that the RTO matches the business needs of the company.
There is no need to have a single RTO across the entire company. It is perfectly normal and reasonable to have a tighter RTO for more critical applications, and a more relaxed RTO for the rest of the data centre.
Recovery point objective (RPO) is the amount of acceptable data loss after a large incident, measured in time. For example, if we agree that we can lose one hour’s worth of data, we have agreed to a one-hour RPO. Most companies, however, settle on values that are much higher, such as 24 hours or more.
This is primarily because the smaller your RPO, the more frequently you must run your back-up system. Many companies might want a tighter RPO, but they realise that it's not possible with their current back-up system. Like the RTO, it is perfectly normal to have multiple RPOs throughout the company depending on the criticality of different data sets.
The recovery point and recovery time reality metrics are measured only if a recovery occurs – whether real or via a test. The RTO and RPO are objectives, the RPR and RTR measure the degree to which you met those objectives after a restore. It is important to measure this and compare it against the RTO and RPO to evaluate whether you need to consider a redesign of your back-up-and-recovery system.
The reality is that most companies’ RTR and RPR are nowhere near the agreed-upon RTO and RPO for their company. What's important is to bring this reality to light and acknowledge it. Either we adjust the RTO and RPO, or we redesigned the back-up system. There is no point in having a tight RTO or RPO if the RTR and RPR are completely different.
What to do with metrics
One of the ways that CIOs can increase the confidence in back-up systems is to document and publish all the metrics mentioned here.
Let management know the degree to which back-up systems are performing as designed. Let them know – based on the current growth rate – how long it will be before they need to buy additional capacity.
And above all make sure that they are aware of the back-up and recovery system’s ability to meet agreed upon RTO and RPO. Hiding this fact will do no one any good if there is an outage.