DSC Weekly Digest 21 June 2021
One of the more underappreciated problems of working with big data systems, machine learning systems, or knowledge graphs is the fact that the number of classes (types of things) can very quickly number in the hundreds or even thousands. If the data involved in these systems comes primarily from external data stores, this can be problematic even with service interfaces, but where the issue becomes quickly unmanageable is in the realm of user interfaces and user experience.
To give an example, a typical ERP such as Salesforce may contain data for people, locations, transactions, accounts, products, and so on, often numbering in the dozens of different kinds of entities being tracked. All too often, the attitude is that this information comes from databases, but the reality is that somewhere along the lines, someone – a data entry person, an account manager, a customer, a shipper, somebody – will need to enter that information into the computer in the first place. This is actually a source of one of the biggest problems in enterprise data management today.
Why? All too often, the data that goes into one data store represents a very narrow (and almost always programmer-designed) view of various kinds of data. In practice, people do not have access to the underlying data but rather are now reliant upon data services – web services, mainly – that transform an existing record into a frequently lossy JSON, XML or JSON file, losing metadata along the way. This means that keys often become poorly transcribed or lose their context, it means that the ability to add additional properties becomes a major, expensive headache. What’s worse, this process when repeated across dozens of systems, creates an impenetrable thorny wall that reduces or even eliminates any kind of flexibility.
One of the benefits of semantic-based systems (knowledge graphs) is that you can solve several data engineering problems at once. Such knowledge graphs are highly connected, but those connections can be traversed with query languages (SPARQL, GQL, GraphQL, Tinkerpop, and so forth) and can be extended with very little effort. Moreover, it becomes possible to infer structure from data, to hold data connected in multiple ways, and to readily handle true temporality, not just the transactional log focus that’s typical with SQL databases.
Additionally, because of this, systems can use inferential reasoning to be able to “ask” the user to provide just the information that is needed, through generated UI elements. A person in your system changes addresses? Intelligent interface design can bring up just the information needed for updating (or adding) that address, and can even tell by UI cues whether to create new address records or update existing ones – without the need for a programmer to create such a UI screen for this. Similarly, machine learning applications can take a hand-drawn sketch (perhaps with dragged elements) and turn it into a user interface trivially.
This approach has many key benefits – consolidation and significant reduction in duplicate (or near-duplicate) data, major reductions in the time and cost of building applications, reducing or even eliminating the need for complex forms (and perhaps the need for everyone to constantly re-enter resume information), among many others. Additionally, such efforts can create large-scale datastores that dynamically respond to changes in models without overt and expensive software production efforts.
It is the ability to better control the entry of data into the data ecosystem in the first place, not just clever chatbots or expensive data-mining efforts, that will enable data-driven companies to succeed. Control the input, the ingestion, of data so that it is consistent with the underlying models early in the process, and you are well on your way to reducing everything from data cleanliness, master data management, feature engineering and even data analysis systems. However, to do that, it’s time to move beyond SQL and start embracing the graph.
In media res,
Kurt Cagle
Community Editor,
Data Science Central