DSC Weekly Digest 06 July 2021
Before launching into my editorial this week, I wanted to make the announcement that starting with this issue, the DSC Newsletter will be sent out on Tuesdays rather than Mondays. To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free!
There has long been a pattern with computer technology. At a certain point in the evolution of technology, there is a realization that things that had been done repeatedly as one-offs occur often enough to start building libraries, or even extensions to languages. For a while, mastery of these libraries defines a certain subset of programmers. or analysts, and typically most of the innovations tend to take place as improvements to these libraries, articles on technical sites or journals, and so forth.
Eventually, however, the capabilities are abstracted into more sophisticated stand-alone applications, frequently with user interfaces that provide ways to handle the most frequent use cases while relegating the edge cases to specialized screens. The results of these, in turn, are wrapped within some kind of container (such as Kubernetes) that can then be incorporated into a pipeline with other similar containers.
This is the direction that machine learning is going. MLOps now complements DevOps, ensuring that changes to machine learning models – from the data engineering necessary to ensure that the source data is ready for production, to feature engineering that can be altered on the fly to try out different scenarios, through to presentation and productization that not only makes sure that the results are understandable to a business audience, but that also can then feed into other operational channels, is here now, and will likely become commonplace within the next couple of years within the industry.
This transformation is critical for several reasons. First, it makes it far easier to create ensemble models, models that are developed and work in parallel, and that can handle different starting scenarios. This is key because the more generalized a model has to be, the more expensive, time-consuming, and complex it turns out, and the less likely that it can handle edge cases accurately. This is especially important when dealing with sparse datasets, where the danger is that single, comprehensive models can badly overfit the input, making such models very brittle to initial conditions.
In addition to this, however, however, by reducing the overall costs of implementing models from months to weeks, or even days, organizations are able to better productize their data analytics in ways that would have been unheard of even a couple of years before. As not all problems can (or should) be solved with machine learning in the first place, the ability to take advantage of more generalized DevOps pipelines within your organization put machine learning right where it belongs – as a powerful tool among many, rather than a single, potentially shaky foundation on its own.
For machine learning and data science specialists, this has other implications as well. Domain proficiency in a given sector will mean more, the ability to write Python or R will mean less, save for those who focus more specifically on tool-building within integrated frameworks. However, having a good understanding of data operations in general and machine learning operations in particular, all engineering tasks, will likely increase in demand dramatically over the next few years. Additionally, those that are better at productizing data, integrating ML streams in with other streams towards the creation of digital assets that can then be published as physical assets, will do quite well.
Machine learning is maturing. There’s nothing wrong with that.
In media res,
Kurt Cagle
Community Editor,
Data Science Central