Menu
USENIX researchers get a grip on Hadoop performance

USENIX researchers get a grip on Hadoop performance

Modeling Hadoop jobs can be tricky because of all the moving parts, researchers say

Now that big data technologies like Apache Hadoop are moving into the enterprise, system engineers must start building models that can estimate how much work these distributed data processing systems can do and how quickly they can get their work done.

Having accurate models of big data workloads means organizations can better plan and allocate resources to these jobs, and can confidently assert when the results of this work can be delivered to customers.

Estimating big data jobs, however, is tricky business, and the process cannot rely entirely on traditional modeling tools, according to researchers speaking at the USENIX annual conference on autonomic computing, being held this week in Philadelphia.

"It's almost impossible to be accurate, because you are dealing with a non-deterministic system," said Lucy Cherkasova, a researcher at Hewlett-Packard Labs.

She explained that Hadoop systems are non-deterministic because they have a wide range of variable factors that can contribute to how long it takes for a job to finish.

The average Hadoop system might have up to 190 parameters to set in order to start running, and each Hadoop job may have different requirements for how much computation, bandwidth, memory or other resources it needs.

Cherkasova has been working on models, and associated tools, to estimate how long a large data processing job will take to run on Hadoop or other large data processing systems, in a project called ARIA (Automatic Resource Inference and Allocation for MapReduce Environments).

ARIA aims to answer the question, "How many resources should I allocate to this job, if I want to process this data by this deadline," Cherkasova said.

One might assume that if you double the number of resources of a Hadoop job, the time required to complete the job would be cut in half. "This is not the case" with Hadoop, Cherkasova said.

Job profiles can change in non-linear ways depending on the number of servers being used. The performance bottlenecks in a Hadoop cluster for 66 nodes are different from the bottlenecks found in a Hadoop cluster of 1,000 nodes, she said.

The performance can vary according to the type of job as well. Some of the research Cherkasova carried out involved studying what sized virtual machine would be best suited for Hadoop jobs.

For instance, Amazon Web Services (AWS) offers a range of virtual servers, from small instances with a single processor to larger ones with eight or more processors. Because Hadoop is a distributed system, it was made to run on multiple servers. But would it be more cost-effective to run Hadoop across many smaller instances, or on fewer though larger smaller instances?

Cherkasova found that the answer depends on the workload.

One type of job, http://www.highlyscalablesystems.com/3235/hadoop-terasort-benchmark/">Terasort, in which a large amount of data is sorted, can be completed five times more quickly by using a collection of small AWS instances compared to using the large instances.

The performance of another type of job, the Kmeans clustering algorithm, does not vary with the kind of instance used, however. It runs equally well on small, medium, or large instances, meaning the user can run a Kmeans job on the more cost-effective large instances without sacrificing any speed.

Cherkasova's work in this field has been important because to date there have been very few widely cited studies on modeling Hadoop performance, said Anshul Gandhi, an IBM researcher who was on the USENIX organizing committee for the conference.

Studying Hadoop can be a challenge because few researchers have access to large Hadoop systems, which are too costly to build and test, Gandhi said.

Also doing work in this realm has been Cristina Abad, a computer science Ph.D. candidate at the University of Illinois at Urbana-Champaign.

Abad has developed a benchmark designed to model the performance of next-generation storage systems, called MimesisBench, and has modeled a workload on a Yahoo 4,100 node cluster running on the Hadoop Distributed File System (HDFS).

The benchmark can help determine if a storage system can accommodate an increased workload, which can be valuable information for determining whether to make major architectural changes when increasing the throughput of a data processing system.

The benchmark showed, for instance, that the Yahoo cluster would start experiencing increased latency when handling approximately more than 16,800 operations per second, which was greater than was expected.

The benchmark could also help in other architectural decisions. For its storage system, Yahoo used a hierarchal namespace, in which files are organized into groups or subdirectories. If Yahoo were to use a flat namespace, where all the files are located in a single list, latency would have started spiking at about 10,284 operations per second, the model showed.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com


Follow Us

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags internetHewlett-Packardpopular science

Events

Featured

Slideshows

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners

This year’s Reseller News 30 Under 30 Tech Awards were held as an integral part of the first entirely virtual Emerging Leaders​ forum, an annual event dedicated to identifying, educating and showcasing the New Zealand technology market’s rising stars. The 30 Under 30 Tech Awards 2020 recognised the outstanding achievements and business excellence of 30 talented individuals​, across both young leaders and those just starting out. In this slideshow, Reseller News honours this year's winners and captures their thoughts about how their ideas of leadership have changed over time.​

Meet the Reseller News 30 Under 30 Tech Awards 2020 winners
Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security

Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security

This exclusive Reseller News Exchange event in Auckland explored the challenges facing the partner community on the cloud security frontier, as well as market trends, customer priorities and how the channel can capitalise on the opportunities available. In association with Arrow, Bitdefender, Exclusive Networks, Fortinet and Palo Alto Networks. Photos by Gino Demeer.

Reseller News Exchange Auckland: Beyond the myths — how partners can master cloud security
Reseller News welcomes industry figures at 2020 Hall of Fame lunch

Reseller News welcomes industry figures at 2020 Hall of Fame lunch

Reseller News welcomed 2019 inductees - Leanne Buer, Ross Jenkins and Terry Dunn - to the fourth running of the Reseller News Hall of Fame lunch, held at the French Cafe in Auckland. The inductees discussed the changing face of the IT channel ecosystem in New Zealand and what it means to be a Reseller News Hall of Fame inductee. Photos by Gino Demeer.

Reseller News welcomes industry figures at 2020 Hall of Fame lunch
Show Comments