IBM speeds deep learning by using multiple servers

IBM speeds deep learning by using multiple servers

IBM's Distributed Deep Learning spreads model training across any number of hardware nodes—as long as they’re IBM nodes

For everyone frustrated by how long it takes to train deep learning models, IBM has some good news: It has unveiled a way to automatically split deep-learning training jobs across multiple physical servers -- not just individual GPUs, but whole systems with their own separate sets of GPUs.

Now the bad news: It's available only in IBM's PowerAI 4.0 software package, which runs exclusively on IBM's own OpenPower hardware systems.

Distributed Deep Learning (DDL) doesn't require developers to learn an entirely new deep learning framework.

It repackages several common frameworks for machine learning: TensorFlow, Torch, Caffe, Chainer, and Theano. Deep learning projecs that use those frameworks can then run in parallel across multiple hardware nodes.

IBM claims the speed-up gained by scaling across nodes is nearly linear. One benchmark, using the ResNet-101 and ImageNet-22K data sets, needed 16 days to complete on one IBM S822LC server. Spread across 64 such systems, the same benchmark concluded in seven hours, or 58 times faster.

IBM offers two ways to use DDL. One, you can shell out the cash for the servers it's designed for, which sport two Nvidia Tesla P100 units each, at about $50,000 a head.

Two, you can run the PowerAI software in a cloud instance provided by IBM partner Nimbix, for around $0.43 an hour.

One thing you can't do, though, is run PowerAI on commodity Intel x86 systems. IBM has no plans to offer PowerAI on that platform, citing tight integration between PowerAI's proprietary components and the OpenPower systems designed to support them.

Most of the magic, IBM says, comes from a machine-to-machine software interconnection system that rides on top of whatever hardware fabric is available.

Typically, that's an InfiniBand link, although IBM claims it can also work on conventional gigabit Ethernet (still, IBM admits it won't run anywhere nearly as fast).

It's been possible to do deep-learning training on multiple systems in a cluster for some time now, although each framework tends to have its own set of solutions.

With Caffe, for example, there's the Parallel ML System or CaffeOnSpark. TensorFlow can also be distributed across multiple servers, but again any integration with other frameworks is something you'll have to add by hand.

IBM's claimed advantage is that it works with multiple frameworks and without as much heavy lifting needed to set things up. But those come at the cost of running on IBM's own iron.

This article originally appeared on InfoWorld.

Follow Us

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags CloudIBMstorageserver



Tech industry comes out in force as Lancom turns 30

Tech industry comes out in force as Lancom turns 30

A host of leading vendors and customers came together to celebrate the birthday of Lancom Technology in New Zealand, as the technology provider turned 30.

Tech industry comes out in force as Lancom turns 30
The making of an MSSP: a blueprint for growth in NZ

The making of an MSSP: a blueprint for growth in NZ

Partners are actively building out security practices and services to match, yet remain challenged by a lack of guidance in the market. This exclusive Reseller News Roundtable - in association with Sophos - assessed the making of an MSSP, outlining the blueprint for growth and how partners can differentiate in New Zealand.

The making of an MSSP: a blueprint for growth in NZ
Show Comments