The disaster recovery maze
- 26 August, 2007 22:00
Disaster recovery used to be easy.
You’d back up your mainframe to tape every night or over the weekend. If you were really conscientious, you’d send the tapes off-site and arrange for contingency processing at some other data centre. Testing your recovery plan? You’d retrieve the tapes and see if you could read them.
Of course, things have gotten steadily more complicated over the years, with distributed and networked computers, n-tier computing, heterogeneous hardware and operating systems, virtualisation, automated data feeds from external parties and more.
Adding to the confusion has been a steady change in the meaning of “disaster.” Ten years ago, a four-hour outage might not have even been noticed by users or customers; today, it could cost you your job.
As a result, it has become vastly more difficult to prepare and test disaster recovery plans, and increasingly unlikely that you will go to bed at night feeling 100 percent sure that all your IT assets are protected.
Companies are dealing with these challenges in various ways. Some are reaching out to external parties for help with disaster recovery planning and hot sites, to which computer processing can be moved quickly in an emergency. Others have pulled back from these arrangements, saying they can better handle the complexity of disaster recovery in-house. Still others are essentially redefining disaster recovery by substituting notions of “disaster avoidance.”
Jerry Grochow, CIO at MIT, illustrates the problem this way: “I once counted a dozen different boxes that had to be up for [an application] to work from end to end, and that’s not unusual. So you ask your SAP application programmer, ‘What’s necessary to recover your system?’ and you don’t necessarily get the full picture, because the programmer doesn’t realise that the authentication server needs to be running so someone can even log on, and it’s running in a different data centre.”
Not only are an organisation’s IT assets no longer all located in a cosy glass room with a raised floor, they may not even be under the control of the IT department. Grochow recalls an earlier job at a brokerage firm that got automated data feeds from 40 external suppliers, noting that some financial institutions have 100 such connections. “How to recover a major data processing application when you have that many feeds is extremely complicated,” he says.
The challenges are legion.
Schneider National in Green Bay, Wisconsin, at one time contracted with a service provider for a disaster recovery hot site but recently decided to set up its own second data centre to serve as a recovery facility. “Ours is a very complex and highly integrated technology environment,” says Paul Mueller, vice president of technology services at the trucking company, which has 36 locations in North America. “As complexity has increased, so has the difficulty associated with hot-site recovery.”
It proved difficult to accurately replicate Schneider’s operating environment at the external facility, Mueller says, and so his semiannual disaster recovery tests were never completely satisfactory. “Invariably, we encountered issues when we executed those tests, such as tapes not being correct,” he says. “Our ability to restore was problematic based on the hardware configurations, operating system configurations and so on.”
Mueller says he is much more comfortable with his new arrangement, but it came at a stiff price. Schneider’s two data centres are connected with redundant fibre-optic cables, redundant telephone systems and dual mainframes backing each other up. “We have invested heavily based on the risk to the enterprise and to the supply chains that we help our customers manage,” he says. “But we felt this investment was absolutely the right way to go.”
And the investment was not just in facilities. With the help of a consultant, Mueller’s staff interviewed 70 business managers and a few key customers. The interviews gleaned estimates of the losses that would result from various types and durations of outages, as well as managers’ recovery-time goals.
“When you have that information consolidated into an assessment document and you get to see the aggregate impact to the business of losing your data centre, it becomes a very compelling story,” Mueller says.
Bob Dowd, CIO at Sonora Quest Laboratories in Tempe, Arizona, says his company can’t afford a fully redundant hot site for disaster recovery, but he has taken other steps aimed at avoiding a disaster. Sonora Quest runs medical tests for 20,000 patients every night and gets the results to doctors by early the next morning, so it’s not hard to imagine the effect that a prolonged outage in its highly automated processes would have on the business.
“We have hardened the computer room and built in all kinds of redundancy, so if one node fails, we have immediate fail-over to another node,” Dowd says. The Tempe data centre has redundant disks, two network cores and no single points of failure. Plus, it does two backups a day, one to a server and another to tapes that are taken off-site.
Still, Dowd worries about the data centre, which sits near the end of a runway at the Phoenix airport. He’d like the safety of a remote backup facility, and he has an idea for getting one on the cheap.
Part of the Tempe data centre is devoted to serving as a test environment for the labs’ systems — effectively a scaled-down duplicate of the production environment. If that were moved to Sonora Quest’s lab in Tucson, Arizona, it could be used as a backup for Tempe, Dowd reasons. “We’d be using it to save the business, not necessarily doing upgrades,” he explains.
Rod Flory, CIO at Lennox International in Richardson, Texas, says the heating and cooling system company has been rolling out server virtualisation software to increase the efficiency and flexibility of its servers. But that has complicated disaster recovery planning, he says.
“With VMware, we are changing our server platforms more frequently — not adding servers, but changing memory, the number of CPUs in them and so on,” Flory says. “So quarter to quarter, our environment looks different, and keeping up with that on the hot site is a challenge.”
Flory says he tests his disaster recovery plan “religiously” once a year, and it’s not a trivial effort. “It’s a project,” he says. “I take five people and set them aside for a few weeks.”
The tests run smoothly enough, Flory says, but he’s considering involving a disaster recovery firm in a future test. “You look at situations like the bird flu. You are counting on five or six people who know how to execute the plan, but what if they are not available?” he says. “Can your plan be scripted well enough that you could hire a consulting group, give them the book and say, ‘Here, execute the plan’?”
And there’s another improvement Flory wants to make. Traditionally, Lennox’s systems have been centralised at company headquarters, but more recently, functions such as e-mail and computer-aided design have been pushed out to servers at manufacturing sites where there are no disaster recovery capabilities.
But including those remote sites in the centralised plan is not simple because they don’t have standard systems at the sites. “We are dealing with a legacy of autonomous decision-making,” Flory says. “We may have Dell servers at one facility and IBM at the next. So you look at 15 to 20 major facilities, and you realise you don’t have a common architecture.”
He says Lennox will try to move the remote sites to a more common architecture — so the central data centre can serve as a hot site for them — but that could take years.
Meanwhile, MIT is supplementing its two on-campus data centres with two additional leased facilities — one a few miles away and the other “many, many” miles away, says Grochow. But these will not be traditional disaster- recovery sites. All four will be in use all the time, with each critical application running at at least two of them. The four centres in total will not have a great deal of excess capacity or redundant equipment, so they will not be prohibitively expensive, Grochow says.
With this set-up, the difficulty of testing a disaster recovery plan almost disappears. Because every site is running all the time, and because each critical application is running in more than one place, the plan is essentially tested every day, Grochow says. “The idea is to always be in a ‘fail-soft’ mode. If you have an architecture that allows certain things to be down, you are never completely out of business,” he explains. “But if your architecture has lots of single points of failure, you have to have a very detailed recovery plan.
“The concept of disaster recovery as we knew it is changing,” Grochow says. “I think we have gotten past the point where you can rely on a third party to provide hot-site recovery, because it has gotten too complicated.”
The outsourcing option
There is a trend among large companies to bring disaster recovery in-house after having outsourced it, according to a recent report from Gartner. Here are some of the reasons:
- A desire to tighten recovery-time windows to 24 hours — and often to less than four hours, which vendors quote at prices that make customers cringe.
- The distance of recovery sites from the customer’s data centre. They’re often too far, making the transport of tapes and personnel costly and slow.
- Inflexible, long-term contracts of three to five years, which may exceed the customer’s planning horizon.
- Inflexible testing options and environments.
But before giving up on service providers, Gartner advises companies to consider that the vendors may have far greater assets in terms of personnel, equipment, power and processing redundancy, and disaster recovery expertise.
A contrarian view
“I’ve been in IT for 33 years, and I don’t believe disaster recovery is getting harder at all,” says Rod Hamilton, CIO of health insurance provider UnitedHealth Group International Inc. in Minneapolis.
While he acknowledges that applications, environments and business needs vary greatly from company to company, Hamilton says three trends have eased the pain for him.
A huge drop in the cost of communications.“Thirty-three years ago, there was no Internet, and the cost of connecting two sites was wildly exorbitant,” says Hamilton. “Now, real-time backup to a remote site is economically feasible.”
Business process outsourcing and offshoring. “In order to offshore a process, you have to make it portable, and as soon as you make it portable, it’s easier to recover,” Hamilton says.
Web-based applications.The move to the Web also has increased application portability, because users as well as developers can access applications from anywhere via the Internet. An operations centre in Miami was frequently shut down during hurricane threats, Hamilton says, but it can now remain open longer because staffers can work from home. “It’s possible, with redirects, to move the back end somewhere, and users on the Internet are none the wiser,” he says.
And he says moving to Web services eases the burden of supporting and recovering desktop systems, because Web portals can deliver functionality to clients without application software having to be installed on them.
“I’ve lived through the main evolutionary moves in application deployment, with traditional mainframe systems requiring support at the centre but virtually none at the desktop, through client/server, with elaborate needs to manage software at each desktop, back full circle to the Web,” Hamilton says. “I view Web technology as a return to the good old days, from a management perspective.”