DSC Weekly Digest 29 March 2021
One of the more significant “quiet” trends that I’ve observed in the last few years has been the migration of data to the cloud and with it the rise of Data as a Service (DaaS). This trend has had an interesting impact, in that it has rendered moot the question of whether it is better to centralize or decentralize data.
There have always been pros and cons on both sides of this debate, and they are generally legitimate concerns. Centralization usually means greater control by an authority, but it can also force a bottleneck as everyone attempts to use the same resources. Decentralization, on the other hand, puts the data at the edges where it is most useful, but at the cost of potential pollution of namespaces, duplication and contamination. Spinning up another MySQL instance might seem like a good idea at the time, but inevitably the moment that you bring a database into existence, it takes on a life of its own.
What seems to be emerging in the last few years is the belief that an enterprise data architecture should consist of multiple, concentric tiers of content, from highly curated and highly indexed data that represents the objects that are most significant to the organization, then increasingly looser, less curated content that represents the operational lifeblood of an organization, and outward from there to data that is generally not controlled by the organization and exists primarily in a transient state.
Efficient data management means recognizing that there is both a cost and a benefit to data authority. A manufacturer’s data about its products is unique to that company, and as such, it should be seen as being authoritative. This data and metadata about what it produces has significant value both to itself and to the users of those products, and this tier usually requires significant curational management but also represents the greatest value to that company’s customers.
Customer databases, on the other hand, may seem like they should be essential to an organization, but in practice, they usually aren’t. This is because customers, while important to a company from a revenue standpoint, are also fickle, difficult to categorize, and frequently subject to change their minds based upon differing needs, market forces, and so forth beyond the control of any single company. This data is usually better suited for the mills of machine learning, where precision takes a back seat to gist.
Finally, on the outer edges of this galactic data, you get into the manifestation of data as social media. There is no benefit to trying to consume all of Google or even Twitter without taking on all of the headaches of being Google or Twitter without any of the benefits. This is data that is sampled, like taking soundings or wind measurements in the middle of a boat race. The individual measurements are relatively unimportant, only the broader term implications.
From an organizational standpoint, it is crucial to understand the fact that the value of data differs based upon its context, authority, and connectedness. Analytics, ultimately, exists to enrich the value of the authoritative content that an organization has while determining what information has only transient relevance. A data lake or operational warehouse that contains the tailings from social media is likely a waste of time and effort unless the purpose of that data lake is to hold that data in order to glean transient trends, something that machine learning is eminently well suited for.
This is why we run Data Science Central, and why we are expanding its focus to consider the width and breadth of digital transformation in our society. Data Science Central is your community. It is a chance to learn from other practitioners, and a chance to communicate what you know to the data science community overall. I encourage you to submit original articles and to make your name known to the people that are going to be hiring in the coming year. As always let us know what you think.