Don’t rush to machine learning
- 28 September, 2021 16:05
It turns out the best way to do machine learning (ML) is sometimes to not do any machine learning at all. In fact, according to Amazon Applied Scientist Eugene Yan, “The first rule of machine learning [is to] start without machine learning.”
Yes, it’s cool to trot out ML models painstakingly crafted over months of arduous effort. It’s also not necessarily the most effective approach. Not when there are simpler, more accessible methods.
It may be an oversimplification to say, as data scientist Noah Lorang did years ago, that “data scientists mostly just do arithmetic.” But he’s not far off, and certainly he and Yan are correct that however much we may want to complicate the process of putting data to work, much of the time it’s better to start small.
Data scientists get paid a lot. So perhaps it’s tempting to try to justify that paycheck by wrapping things like predictive analytics in complicated jargon and ponderous models. Don’t. Lorang’s insight into data science is as true today as when he uttered it a few years back: “There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means.” Lorang recommends simpler methods, such as “SQL queries to get data, ... basic arithmetic on that data (computing differences, percentiles, etc.), graphing the results, and [writing] paragraphs of explanation or recommendation.”
I’m not suggesting it’s easy. I’m saying that machine learning isn’t where you start when trying to glean insights from data. Nor is it the case that copious quantities of data are necessarily needed. In fact, as Eligible CEO Katelyn Gleason argues, it’s important to “start with the small data [because] it’s eyeballing anomalies that have led me to some of my best findings.” Sometimes it may be enough to plot distributions to check for obvious patterns.
Yes, that’s right: data can be “small enough” that a human can detect patterns and uncover insights.
Small wonder then that iRobot data scientist Brandon Rohrer suggests cheekily: “When you have a problem, build two solutions—a deep Bayesian transformer running on multicloud Kubernetes and a SQL query built on a stack of egregiously oversimplifying assumptions. Put one on your resume, the other in production. Everyone goes home happy.”
Again, this isn’t to say that you should never use ML, and it’s definitely not an argument that ML doesn’t offer real value. Far from it. It’s just an argument against starting with ML. To dig deeper into why, it’s worth reviewing Yan’s article on the topic.
Humans getting to know data
First, Yan notes, it’s important to recognise just how hard it is to pull meaning from data, given the critical ingredients: “You need data. You need a robust pipeline to support your data flows. And most of all, you need high-quality labels.”
In other words, the inputs are tricky enough that it may not be particularly helpful to start by throwing ML models at the problem. At that point, you’re just getting to know your data. Try solving the problem manually or with heuristics (practical methods or shortcuts). Yan highlights this reasoning from Hamel Hussain, a machine learning engineer at GitHub: “It will force you to become intimately familiar with the problem and the data, which is the most important first step.”
Assuming you’re dealing with tabular data, Yan says it pays to start with a sample of the data to run statistics, starting with simple correlations, and visualise the data, perhaps using scatter plots. For example, instead of building a complicated machine learning model for recommendations, you could simply “recommend top-performing items from the previous period,” Yan argues, then look for patterns in the results. This helps the ML practitioner become more familiar with her data which in turn will help her build better models—if they prove necessary.
When does machine learning become necessary or at least advisable?
According to Yan, machine learning starts to make sense when maintaining your non-ML system of heuristics becomes overly cumbersome. In other words, “after you have a non-ML baseline that performs reasonably well, and the effort of maintaining and improving that baseline outweighs the effort of building and deploying an ML-based system.”
There is no hard science of when this happens, of course, but if your heuristics are no longer practical shortcuts and instead keep breaking, it’s time to consider machine learning, particularly if you have solid data pipelines and high-quality data labels, indicating good data.
Yes, it’s tempting to start with complex ML models, but arguably one of the most important skills a data scientist can have is common sense, knowing when to rely on regression analysis or a few if/then statements, rather than ML.