DSC Weekly Digest 17 August 2021
As data systems become more complex (and far-reaching), so too does the way that we build applications. On the one hand, enterprise data no longer just means the databases that a company owns, but increasingly refers to broad models where data is shared among multiple departments, is defined by subject matter experts, and is referenced not only by software programs but complex machine learning models.
The day where a software developer could arbitrarily create their own model to do one task very specifically seems to be slipping away in favor of standardized models that then need to be transformed into a final form before use. Extract, transform, load (ETL) has now given way to extract, load, transform (ELT). There’s even been a shift in best practices in the last couple of decades, with the idea that you want to move core data around as little as possible and rely instead upon increasingly sophisticated queries and transformation pipelines.
At the same time, the notion is growing that the database, in whatever incarnation it takes, is always somewhat local to the application domain. The edge is gaining in intelligence and memory, indeed, most databases are moving towards in-memory stores, and caching is evolving right along with them.
The future increasingly is about the query. For areas like machine learning, the query ultimately comes down to making models so that they are not only explainable, but tunable as well. The query response is becoming less and less about single the answer, and more about creating whole simulations.
At the same time, the hottest databases are increasingly graph databases that allow for inferencing, the surfacing of knowledge through the subtle interplay of known facts. Bayesian analysis (in various forms and flavors) has become a powerful tool for predicting the most likely scenarios, with queries here having to straddle the line between utility and meaningfulness. What happens when you combine the two? I expect this will be one of the hottest areas of development in the coming years.
SQL won’t be going away – the tabular data paradigm is still one of the easiest ways to aggregate data – but the world is more than just tables. A machine learning model, at the end of the day, is simply an index, albeit one where the keys are often complex objects, and the results are as well. A knowledge graph takes advantage of robust interconnections between the various things in the world and is able to harness that complexity, rather than get bogged down by it.
It is this that makes data science so interesting. For so long, we’ve been focused primarily on getting the right answers. Yet in the future, it’s likely that the real value of the evolution of data science is learning how to ask the right questions.
In media res,
To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free!