DSC Weekly Digest 13 July 2021
I talk to a lot of people involved in the data science and machine learning space every week – some vendors, some company CDOs, many just people in the trenches, trying to build good data models and make them monetizable.
When I ask what part of the data science pipeline they have the hardest part with, the answer is almost invariably “We can’t get enough good data.”
This is not just a problem with machine learning, however. Knowledge Graph projects have run aground because they discover that too much of the data that they have lacks sufficient complexity (read, connectivity) to make modeling worthwhile. The data is often poorly curated, poorly organized, and lacking in semantic metadata. Some data, especially personal data, is heavily duplicated, has keys that have been lost in context, and in many cases cannot in fact be collected without a court order. Large relational databases have been moved into data lakes or enterprise data warehouses, but the data within them often heavily reflects operational rather than contextual information, made worse by the fact that many programmers have at best only limited training in true data modeling practices.
What this means is that the content that drives the initial training of the data model is noisy, with the signal so weak that any optimizations made in the model itself may put the data scientist into a position where they are able to reach the wrong conclusions faster.
Effective data strategy involves assessing the acquisition of the data from the beginning, and recognizing that this acquisition will require the expenditure of money, time, and personnel. There are reasons why data aggregators usually tend to benefit heavily from being early adopters – they discovered this truth the hard way, and made the investment to make their businesses data scoops, with effective data acquisition and ingestion strategies rather than just assuming that the relational databases in the back office actually had worthwhile grist for the mill.
As data science and machine learning pipelines become more pervasive in organizations and become more automated, through MLOps and similar processes, this need for good source data is likely to be one that every organization’s CDO needs to attend to as soon as possible. After all, garbage in can only mean garbage out.
In media res,
To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free!