How Rakuten freed itself of Hadoop investment in two years
- 26 June, 2020 10:22
Based in San Mateo, California, Rakuten Rewards is a shopping rewards company that makes money through affiliate marketing links across the web. In return, members earn reward points every time they make a purchase through a partner retailer and get cash back rewards.
Naturally this drives a lot of user insight data – hundreds of terabytes on active recall with more in cold storage, to be exact.
In 2018 the business started to get serious about giving more users access to this insight – without having Python or Scala coding chops – while also reducing its capital expenditure on hardware, and started looking to the cloud.
‘SQL server machines don’t scale elegantly’
Formerly known as Ebates, the business was acquired in 2014 by the Japanese e-commerce giant Rakuten, and has been growing fast since, forcing a drive to modernise its technology stack and become more data-driven in the way it attracts and retains customers.
This starts with the architecture. In the past three years Rakuten Rewards has moved its big data estate from largely on-prem SQL to on-prem Hadoop to, today, a cloud data warehouse courtesy of Snowflake.
“SQL server machines don’t scale elegantly, so we went on-premises Hadoop with Cloudera, using Spark and Python to run ETL, and got some performance out of that,” vice president for analytics at Rakuten Rewards, Mark Stange-Tregear, told InfoWorld.
“Managing that [Hadoop] structure is not trivial and somewhat complicated, so when we saw the cloud warehouses coming along we decided to move and have this centralised enterprise-level data warehouse and lake,” he said.
As former Bloomberg developer and big data consultant Mark Litwintschik argues in his blog post 'Is Hadoop Dead?', the world has moved on from Hadoop after the halcyon days of the early 2010’s.
Now, cloud frameworks which take much of the heavy lifting away from data engineering teams are proving more popular with enterprises looking to reduce the cost of having on-prem machines sit idle – and to streamline their analytics operations overall.
Moving on from Hadoop
So Stange-Tregear and lead data engineer Joji John decided in mid-2018 to start a major data migration from its core systems to the Snowflake cloud data warehouse on top of Amazon Web Services (AWS) public cloud infrastructure.
That migration started with the reporting layer and some of the most-used data sets across the business, before moving ETL and actual data generation workloads, all of which was completed towards the end of 2019, barring some more sensitive HR and credit card information.
By leveraging cloud computing, Rakuten is better able to scale up and down for peak shopping times. Snowflake also allows the company to split its data lake into a series of different warehouses of different shapes and sizes to meet the requirements of different teams, even spinning up new ones for one-off projects as required, without teams competing for memory or CPU capacity on a single cluster.
Previously, “a big SQL query from one user could effectively block or bring down other queries from other users, or would interrupt parts of our ETL processing,” Stange-Tregear explained. “Queries were taking longer and longer to run as the company grew and our data volumes exploded.
“We ended up having to try and replicate data onto different machines just to avoid these issues, and then introduced a series of other issues as we had to handle the scope for large-scale data replication and syncing.”
How Rakuten rewards its analysts
Now Rakuten can more easily reprocess customer segments, down to a single user’s entire shopping history, every day. It can then remodel their interest areas for more effective marketing targeting or recommendations modelling. This helps hit a customer with a targeted offer at the moment they are really considering buying that new pair of shoes, rather than giving them time to think about it.
“For tens of millions of accounts, we can crank that through several times a day,” Stange-Tregear explained. “Then package that for each user to a JSON model, for each member profile to recalculate for all users multiple times a day,” to be queried with just a few lines of SQL.
This greatly democratises the analytics, from granular insights from data scientists with Python or Spark skills to any analyst familiar with SQL.
“It’s easier to find people who code in SQL than Scala, Python, and Spark,” Stange-Tregear admits. “Now my analytics team – some with Python skills and less with Scala – can create data pipelines for reporting, analytics, and even feature engineering more easily as it comes in a nice SQL package.”
Other big data jobs, like processing payment runs, now also take significantly less time thanks to the performance boost of the cloud.
“Processing hundreds of millions of dollars in payments takes a lot of work,” Stange-Tregear said. “Those runs used to be a material quarterly effort which took weeks, now we can rescore and process that and recalibrate in a couple of days.”
Life after Hadoop
All of this effort comes with some cost efficiencies, too. Stange-Tregear, Joji John, and the CFO now all get daily Tableau reports detailing daily data processing spend, split by business function.
“We can see the effective cost for each [function] and make that consistent over time,” Stange-Tregear explained. “We can easily go in and see where we are spending and where to spend time optimising, and new workloads show us the cost immediately. That was difficult with Hadoop.”
Like many companies before them, Rakuten Rewards milked as much value out of its Hadoop investment as possible, but when an easier way to maintain that platform emerged – while enabling a much wider range of users to benefit – the rewards far outweighed the costs.