Five lessons of a datacentre overhaul
- 17 June, 2008 22:00
What are the three most important ingredients of a successful project? Planning, planning, planning. For our datacentre makeover at the University of Hawaii's School of Ocean and Earth Science and Technology, we planned early and often, and still got bit by last-minute surprises and devilish details that cost us time and money. We'll do it a little different next time. You too can learn from our mistakes.
Our little room in the Hawaii Institute of Geophysics, HIG 319, was no stranger to servers, though it only had a casual acquaintance with them. When we started the project, the room had six racks installed, one with an 80kW APC InfraStruXure UPS being used at 40kW capacity, and most of the rest of the racks only partially populated with servers for the various SOEST departments.
SOEST needed the new datacentre to house a number of new server clusters for use by the research labs. An initial estimate would add three clusters comprised of a mix of traditional servers and blade servers housed in new racks. Managing this upgrade would require doubling HIG 319's square footage, adding an additional 250 amps of electrical power on a new breaker panel, and completely revamping the cooling system, which at the beginning of the project consisted of three wall-mounted window-style air conditioners that were already giving their all, to little effect.
Although HIG 319 had some drawbacks in terms of location, the tight deadline precluded any more political wrangling for a more favorable position on the building's ground floor, which was occupied by several research labs. Besides, the maintenance corridor directly behind the room was a welcome advantage, and the room directly next to HIG 319 was a little-used storage room exactly the same size. Combining the rooms would give us the square footage we needed. We drew a deep breath and took the plunge.
Lesson 1: Give your physical space a good physical
A basic task list was fleshed out in February of 2007 and work began immediately, temporarily moving HIG 319's existing servers, removing whatever artifacts were being stored in HIG 319a, knocking down the wall separating the two rooms, and gutting everything else. A sexy new tile floor had been installed, the walls painted, and new lighting wired up when the campus facilities management department threw us the first curve ball.
Because the SOEST building is almost 50 years old, it's standard UH practice to have a structural engineer vet the room before anything as heavy as a new datacentre is installed -- we just didn't find out about that little detail until it was too late to go anywhere else. Further, because the building's original structural records had long since disappeared into Hawaii's tropical ether, the engineer had to start from scratch with his calculations.
This effectively paralyzed the project for a solid month, since nothing could happen until the engineer rendered his verdict. Four weeks later the engineer announced the floor stable...barely. While the two rooms could house a datacentre, it would have to be a lightweight datacentre because most of the racks would be limited to an 800-pound maximum load, the few exceptions being certain areas over the support beams. That was a nasty kick in the nethers, given that a fully loaded cluster-running rack can weigh as much as 2000 pounds and we had planned on using six of the 12 racks in the new datacentre for Beowulf clusters. Strike one -- back to the drawing board.
A flurry of tropical meetings later and we had what looked like an effective workaround. The four server clusters would move to another location, while the HIG datacentre would now house departmental servers from the various SOEST departments in 12 APC InfraStruXure racks. This would effectively make HIG 319 the central datacentre for all these departments while freeing up space for the clusters at the other locations. Not an optimal solution, but a necessary move if the college intended to install the new server clusters it wanted.
Lesson 2: Don't skimp on professional services
Work on gutting and remodeling HIG 319 resumed and we made our first official contacts with APC for power and cooling solutions and rack requirements. The information we received back took into account our square footage, the current electrical and cooling specs of the two rooms, and our intended server and rack load. APC ran all these figures through its datacentre planning tool and sent back a series of PDFs that gave us an initial floor plan, the names and model numbers of the power and cooling solutions they recommended, and a basic blueprint of every rack in the new datacentre. Initially this looked great, but later we found we'd made a critical mistake.
APC was kind enough to volunteer not only the equipment, but also manpower for the project. Understandably, the company wanted to save as much money as it could here, so our project was run using the cost-savings model rather than the full-on professional service model of APC datacentre design. The deluxe model would have required more manpower in the form of a project manager on APC's side.
For readers embarking on their own datacentre project, we can't over-recommend spending the money on full professional services consulting with a core vendor such as APC. Had we the good sense to solicit the service, UH reps say they would have tried to come up with the money somewhere, because trying to save cash by running without such help is very risky -- as we were about to find out.
Even at this early planning stage, an APC project manager would have gone over every detail in a conference call, whereas we simply received PDF-laden e-mails; he also would have given recommendations for installing the wiring, piping, and other prerequisites. Opting for the unroyal treatment, we were simply referred to a reference page on APC's Web site that showed piping specs for a variety of different cooling solutions. Left to our own devices -- and the recommendation of a UH air-conditioning engineer who misunderstood some specifications -- we made the wrong choice.
In short, there's no substitute for expert guidance. An APC project manager would have made this selection for us and simply told us what to install. The right piping would have been a no-brainer from the start, instead of a last-minute correction, and nearly a costly rip-out-and-replace exercise.
Lesson 3: Saddle a project team member with detail duty
We did get some good advice from APC on our cooling solution, though even here a consultant would have helped. APC consultant or no, it would have been a good move to assign one of our project team to detail duty. We had a project lead coordinating activity and making sure the work was getting done. But we had no one tracking those critical little details -- product specifications, order status, supporting documentation -- that set us back time and again.
The order for our cooling solution was a case in point. Originally, we'd hoped to use the building's chill-water cooling, because that's typically the most cost-effective choice for small datacentres like this one. However, the chill water capacity was already taken up cooling existing labs. We'd have to use something else. APC's product engineers put their heads together and recommended the InRow RP, a solution that uses two roof-mounted condensers matched to two APC SX rack-mounted compressors and evaporator assemblies. The InRow RP was the next-best thing to chill-water from a cost standpoint, and installation promised to be straightforward. Install the appropriate mounting brackets on the roof and run the right piping to HIG 319-319a through the pipe chase behind the room, and we'd be good to go. The best part is that the InRow solution is significantly more efficient than traditional datacentre cooling units, so we're banking on significant energy savings as well.
After the adventure with the just-in-time structural inspection, SOEST's lead facilities manager, Phil Rapoza, insisted on proceeding with extreme care. Phil flat refused to begin construction on the condenser roof mounts until the condensers actually arrived. A good thing, too, because the two multi-ton condenser units we received were somewhat different than the unit described in APC's submittal drawings -- different enough that the mounting brackets originally spec'd would have been useless.
One last condenser problem came to light only shortly before the units were ready to ship. Our project team assumed that APC's sales team would know to coat the condensers with outdoor sealant for Hawaii's highly salty, rust-inducing atmosphere. But without an APC project manager on the job, or any of us minding the order, the APC sales people weren't even consulted. As a result, immediately upon arrival the condensers had to be moved from the shipping company's truck onto a university truck and taken to a weather coating professional elsewhere on the island. This at considerable additional expense to SOEST, and the additional cost of a five-day delay during the construction phase of the project.
Putting a project team member in charge of tracking order details and other minutia likely would have avoided these difficulties. Weather-proofing would have been included in the original condenser order. Changes to the condenser models would have been noted long before they arrived, eliminating confusion over the mounting brackets. If you're diving into a datacentre project, you'll want to make sure that a dedicated, detail-oriented project manager is at the top of the budget list. Trust us when we say that the position will more than pay for itself as the project moves along.
Lesson 4: Hold your team close, and your vendors closer
One area for a project manager's special attention is the vendors. There are plenty of places a vendor can trip you up. Watch them like a hawk.
One example was our shipping experience with APC. The sheer volume of gear that APC shipped us for this project was staggering. We wound up using an entire 40 foot container truck to delivery our goods--and it was stuffed. You don't overnight something like that. That gets shipped via ground and sea by a contractor other than APC. And that's where the trouble starts.
Naturally, we simply took APC's word that the shipment was en route as ordered -- and they, in turn, were taking the shipping company's word for it. It turned out, however, that our stuff was ordered and consequently shipped later than we'd thought. That became an issue right around the time we realized the cooling condensers had to be weather coated. Because the project deadline loomed just two weeks away, adding a week or so of weather coating into the schedule was a big problem.
But when APC tried to find our goods in the shipping company's records to see if we could either halt the condenser delivery so APC could coat them, or speed it up so we wouldn't be so crunched for time, the shipping company couldn't give us an exact location. By the time it could, the condensers were bobbing across the Pacific. We couldn't even get the shipping company to prioritize our container so it would get dropped on the dock early Monday morning. We ended up having to shift project deadlines and travel schedules. Staying on top of your vendor's shipping process may be a pain, but it will serve up golden dividends of efficiency on project day.
Another important part of vendor watching is staying on top of equipment orders. We weren't nearly careful enough here. Don't just place the order, glance at the P.O., and assume they're shipping what you want. We did and it hurt. Even the best vendors with the best intentions can make critical mistakes when filling orders. Only the caution of Phil Rapoza, our facilities manager, saved us from APC's condenser spec-and-switch. We also had a full cable management system spec'd out and ordered, but suddenly the vendor (who shall remain nameless) backed out, claiming resource problems. Here again, Phil Rapoza and his band of merry men saved the day, fabricating cable ladders customized for the room when an alternative supplier couldn't be found in time.
Your project's problems might have different root causes , but in an industry that moves as fast as ours companies can go out of business, shift direction, or be acquired over the course of a weekend, leaving customers holding the bag when orders disappear into the ether. Count on orders and shipments to go wrong. Plan for the unexpected by getting an early jump when you can and building time for unexpected delays into your project schedule.
Lesson 5: Make a migration checklist and check it twice
Finally, moving day arrived. To make our migration easier, we'd contacted the boisterous folks at Silverback Migration Solutions, a Walnut Creek, CA-based outfit that specializes in helping companies perform datacentre migrations and build-outs. Where a general IT staff might take days to put racks together, add shelving and other accessories, slide the servers in on new rails, and test functionality end-to-end, Silverback cranks through these tasks in record time, sometimes installing 30 or 40 full racks of servers in a single day. (In our case, it was 10 racks in a few hours; see "Pimp my datacentre: SilverBack Migration Solutions".)
But while Silverback's on-site reps were willing, our planning was weak. Though we'd had months to do the prep work, we'd slipped into complacency and simply assumed certain things would work out as desired. Murphy gleefully proved us wrong.
Using APC's datacentre planning software, we'd created the necessary blueprint of our new physical layout, but instead of fleshing that out we let it go and assumed that auto-generated floor maps were enough. A conversation with Silverback and Rackwise reps cured us of that notion, sending us back to the drawing board to fill in some important gaps.
APC's rack maps were a good start, but they're not designed to take into account customized weights of individual pieces of infrastructure -- they use reference weights from a vendor database in order to provide ballpark figures. So the standard weight APC provides for a Dell PowerEdge 1650, for example, might reflect a configuration with two hard drives, whereas our servers might have four. Not a big difference for one server, but when you multiply by dozens per rack, and you're facing an 800-pound weight limit, the true weight becomes important. We were forced to make several rack reconfiguration decisions on the fly.
A second important omission was failing to gather full technical documentation for the equipment being migrated. Because the HIG 319 datacentre was to serve as a co-location facility for a number of SOEST departments, we would be re-racking, rewiring, and reassembling systems from all around campus. Detailed notes -- and aha, admin passwords! -- would be needed to put them all back together again. Yes, we not only lacked detailed information on how some of that equipment was configured, we even failed to collect the admin passwords for six servers we moved from another building. That meant we couldn't bring them up for testing until their research administrators could be found. Like most server migrations, ours was performed during off-hours, so not having the passwords wound up pushing the final equipment testing part of our plan well into production hours on the following day.
A migration day typically leaves little room for error or indecision. Before that day comes, you should have a punch list -- a list of detailed, step by step instructions -- that will guide everyone's actions from start to completion. We suggest that all team members keep notes on loose ends and to do's, and provide them to the team leader in time to create the punch list before any vendors go home. Don't let your solutions providers leave unfinished business behind. It will only be harder to finish without their help.
Finally, the project leader can't expect to be everyone's friend if you want your project to succeed. Guide your team and your vendors with a firm hand, and throw a completion party to smooth over any bruised egos. And be sure to schedule a wrap-up meeting to discuss what went right and what went wrong. Someday you might have to do this again.