In recent months there appears to be a growing consensus in the startup world that data, not algorithms, is the lifeblood of nascent technology companies. In many ways this makes sense, particularly in light of the proliferation of highly commoditized tools for implementing many common machine learning workflows.
Amazon Machine Learning, for instance, boasts a service that “makes it easy for developers of all skill levels to use machine learning technology.” By abstracting away the work of infrastructure, programming, and in many cases even a cursory understanding of the underlying algorithms, tools such as these have leveled the playing field for many startups to develop products that can leverage machine learning techniques.
These emails are received almost daily from vendors selling based on various “easy button” technologies for data science automation, which promptly get deleted.
Matt Turck, the Managing Director of FirstMark Capital, wrote a very well articulated piece last week in which he described the notion of data network effects, a positive data feedback loop in which users contribute data, either overtly (ex. Google/LinkedIn login) or implicitly (x.ai, SiftScience) through mining user behavior, resulting in better data that feed algorithms to provide self-reinforcing value to users.
All things being equal, the best way to achieve this enviable state, is to be differentiated from competitors at the onset with some proprietary source of information. I completely agree, and would add conversations about data technology should also reinforce the reality that applying machine learning techniques to data without rigorous understanding of both it and the algorithm is reckless.
While many machine learning techniques, such as Random Forests, have wonderfully documented APIs that are arbitrarily scalable, there is no generalized technique that I’m aware of that will make up for poorly engineered input features.
Feature extraction is foundational to data science, and without intimate knowledge of the interior mechanics of algorithms and intuition for their inadequacies it is difficult to extract robust consistent value from a data source with even the most sophisticated of techniques.
Machine learning is the “high interest credit card of technical debt”, especially so when wielded by those without a rigorous understanding of the underlying mathematics. The emergence of blackbox data science toolkits, coupled with the delusional notion that anyone who has ever seen a spreadsheet or written a line of code can be a data scientist, trivializes the nuanced art of data analysis to the detriment of scientific rigor in product development.
Training and debugging machine learning models can often be arduous for even the most experienced data scientists, but will be insurmountable tasks without the proper foundations.
The business world needs to remember that this is hard work, and to quote the Kaggle tag line there is “no free hunch”. Abstracting away the mechanism behind machine learning and algorithms as a black box will come at the cost of scientific rigor and robustness, which will inevitably lead to failure. Innovation comes from deep understanding, and there is no “easy button” for that.