Google, Amazon reveal their secrets of scalability

Google, Amazon reveal their secrets of scalability

As large IT systems scale to unforeseen levels of complexity, new laws of effective management come into play

Internet giants such as Google and Amazon run IT operations that are far larger than most enterprises even dream of, but lessons they learn from managing those humongous systems can benefit others in the industry.

At a few conferences in recent weeks, engineers from Google and Amazon revealed some of the secrets they use to scale their systems with a minimum of administrative headache.

At the Usenix LISA (Large Installation Systems Administration) conference in Washington, Google site reliability engineer Todd Underwood highlighted one of the company's imperatives that may be surprising: frugality.

"A lot of what Google does is about being super-cheap," he told an audience of systems administrators.

Google is forced to maniacally control costs because it has learned that "anything that scales with demand is a disaster if you are not cheap about it."

As a service grows more popular, its costs must grow in a "sub-linear" fashion, he said.

"Add a million users, you really have to add less than a 1,000 quanta of whatever expense you are incurring," Underwood said. A "quanta" of expense could be people's time, compute resources, or power.

That thinking is behind Google's efforts not to purchase off-the-shelf routing equipment from companies such as Cisco or Juniper. Google would need so many ports that it's more cost-effective to build its own, Underwood said.

He refuted the idea that the challenges Google faces are unique to a company of its size. For one, Google is composed of many smaller services, such as Gmail and Google+.

"The scale of all of Google is not what most application developers inside of Google deal with. They run these things that are comprehensible to each and every one of you," he told the audience.

Another technique Google employs is to automate everything possible. "We're doing too much of the machines' work for them," he said.

Ideally, an organization should get rid of its system administration altogether, and just build and innovate on existing services offered by others, Underwood said, though he admitted that's not feasible yet.

Underwood, who has a flair for the dramatic, stated: "I think system administration is over, and I think we should stop doing it. It's mostly a bad idea that was necessary for a long time but I think it has become a crutch."

Google's biggest competitor is not Bing or Apple or Facebook. Rather, it is itself, he said. The company's engineers aim to make its products as reliable as possible, but that's not their sole task. If a product is too reliable -- which is to say, beyond the five 9's of reliability (99.999 percent) -- then that service is "wasting money" in the company's eyes.

"The point is not to achieve 100 percent availability. The point is to achieve the target availability -- 99.999 percent -- while moving as fast as you can. If you massively exceed that threshold you are wasting money," Underwood said.

"Opportunity costs is our biggest competitor," he said.

The following week at the Amazon Web Services (AWS) re:Invent conference in Las Vegas, James Hamilton, AWS' vice president and distinguished engineer, discussed the tricks Amazon uses to scale.

Though Amazon is selective about what numbers it shares, AWS is growing at a prodigious rate. Each day, it adds the equivalent amount of compute resources (servers, routers, data center gear) that it had in total in the year 2000, Hamilton said. "This is a different type of scale," he said.

Key for AWS, which launched in 2006, was good architectural design. Hamilton admitted that Amazon was lucky to have got the architecture for AWS largely correct from the beginning.

"When you see fast growth, you learn about architecture. If there are architectural errors or mistakes made in the application, and the customers decide to use them in a big way, there are lots of outages and lots of pain," Hamilton said.

The cost of deploying a service on AWS comes down to setting up and deploying the infrastructure, Hamilton explained. For most organizations, IT infrastructure is an expense, not the core of its business. But at AWS, engineers focus solely on driving down costs for the infrastructure.

Like Google, Amazon often builds its own equipment, such as servers. That's not practical for enterprises, he acknowledged, but it works for an operation as large as AWS.

"If you have tens of thousands of servers doing exactly the same thing, you'd be stealing from your customers not to optimize the hardware," Hamilton said. He also noted that servers sold through the regular IT hardware channel often cost about 30 percent more than buying individual components from manufacturers.

Not only does this allow AWS to cut costs for customers, but it also allows the company to talk with the component manufacturers directly about improvements that would benefit AWS.

"It makes sense economically to operate this way, and it makes sense from a pace-of-innovation perspective as well," Hamilton said.

Beyond cloud computing, another field of IT that deals with scalability is supercomputing, in which a single machine may have thousands of nodes, each with dozens of processors. On the last day of the SC13 supercomputer conference, a panel of operators and vendors assembled to discuss scalability issues.

William Kramer, who oversees the National Center for Supercomputing Applications' Blue Waters machine at the University of Illinois at Urbana Champaign, noted that supercomputing is experiencing tremendous growth, driving the need for new workload scheduling tools to ensure organizations get the most from their investment.

"What is now in a chip -- a single piece of silicon -- is the size of the systems we were trying to schedule 15 years ago," Kramer said. "We've assumed the operating system or the programmer will handle all that scheduling we were doing."

The old supercomputing metrics of throughput seem to be fraying. This year, Jack Dongarra, one of the creators of the Linpack benchmark used to compare computers on the SC500 list, called for additional metrics to better gauge a supercomputer's effectiveness.

Judging a system's true efficiency can be tricky, though.

"You want to measure the amount of work going through the system over a period of time," and not just a simplistic measure of how much each node is being utilized, Kramer said.

He noted that an organization can measure the utilization of a system by measuring the percentage of time each node is utilized. But this approach can be misleading in that a workload can be slowed to increase the utilization rate, but as a result, less work is going through the system overall.

John Hengeveld, Intel's director of HPC marketing, suggested the supercomputing community take a tip from manufacturers of airplane jet engines.

"At Rolls-Royce, you don't buy a jet engine any longer, you buy hours of propulsion in the air. They ensure you get that number of hours of propulsion for the amount of money you pay. Maybe that is the way we should be doing things now," Hengeveld said. "We shouldn't be buying chips, we should buy results."

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is

Follow Us

Join the New Zealand Reseller News newsletter!

Error: Please check your email address.

Tags Amazon Web ServicesGoogleit strategyCIO rolebest practicesIT management



Educating from the epicentre - Why distributors are the pulse checkers of the channel

Educating from the epicentre - Why distributors are the pulse checkers of the channel

​As the channel changes and industry voices deepen, the need for clarity and insight heightens. Market misconceptions talk of an “under pressure” distribution space, with competitors in that fateful “race for relevance” across New Zealand. Amidst the cliched assumptions however, distribution is once again showing its strength, as a force to be listened to, rather than questioned. Traditionally, the role was born out of a need for vendors and resellers to find one another, acting as a bridge between the testing lab and the marketplace. Yet despite new technologies and business approaches shaking the channel to its very core, distributors remain tied to the epicentre - providing the voice of reason amidst a seismic industry shift. In looking across both sides of the vendor and partner fences, the middle concept of the three-tier chain remains centrally placed to understand the metrics of two differing worlds, as the continual pulse checkers of the local channel. This exclusive Reseller News Roundtable, in association with Dicker Data and rhipe, examined the pivotal role of distribution in understanding the health of the channel, educating from the epicentre as the market transforms at a rapid rate.

Educating from the epicentre - Why distributors are the pulse checkers of the channel
Kiwi channel reunites as After Hours kicks off 2017

Kiwi channel reunites as After Hours kicks off 2017

After Hours made a welcome return to the channel social calendar last night, with a bumper crowd of distributors, vendors and resellers descending on The Jefferson in Auckland to kickstart 2017. Photos by Maria Stefina.

Kiwi channel reunites as After Hours kicks off 2017
Arrow exclusively introduces Tenable Network Security to A/NZ channel

Arrow exclusively introduces Tenable Network Security to A/NZ channel

Arrow Electronics introduced Tenable Network Security to local resellers in Sydney last week, officially launching the distributor's latest security partnership across Australia and New Zealand. Representing the first direct distribution agreement locally for Tenable specifically, the deal sees Arrow deliver security solutions directly to mid-market and enterprise channel partners on both sides of the Tasman.

Arrow exclusively introduces Tenable Network Security to A/NZ channel
Show Comments