Why GraphQL Will Rewrite the Semantic Web
I’m relatively old school, semantically speaking: my first encounters with RDF was in the early 2000s, not long after Tim Berners-Lee’s now-famous article in Scientific American introducing the Semantic Web to the world. I remember working through the complexities of RDFS and OWL, spending a long afternoon with one of the editors of the SPARQL specification in 2007, promoting SPARQL 1.1 and SHACL in the mid-2010s, and watching as the technology went from being an outlier to having its moment in the sun just before COVID-19 hit.
I like SPARQL, but increasingly I have to admit a hard reality: there’s a new kid on the block that I think may very well dethrone the language, and perhaps even RDF. I’m not talking about Neo4J’s Cypher (which in its open incarnation is intriguing), or GQL, TigerGraph’s SQL-like language intended to bring SQL syntax to graph querying. Instead, as the headline suggests, I think that the king of the hill will likely end up being GraphQL.
The Semantic Web Is In Trouble
Before getting a lot of brickbats from colleagues in the community about this particular assertion, I want to lay out some of my own observations about where and why I believe the Semantic Web is currently in trouble:
Too Complex. It took me a few years to really grok how RDF worked, in part because it assumed that people would be able to understand the graph paradigm and logical inferencing models. If you have a Ph.D. in computational linguistics, RDF is not hard to understand, but if you have a two-year certificate in programming JavaScript or Python, chances are pretty good that RDF’s graph model is incomprehensible. Add into that configuring triple store graphs can be a logistical nightmare and the likelihood that most programmers – let alone data analysts – would have encountered RDF drops dramatically.
Inference an Edge Case. One of the most powerful aspects of RDF, at least as far as proponents of the technology would have it, is its ability to be used for logical inferencing. Inferencing, which involves the ability to use aspects of the model itself to surface new information, can make for some very potent applications, but only if the model itself is navigable in the same way as other information, and only if the model is designed to make such inferences easily. However, in practice, many complex models have foundered because inheritance was made too complicated or the models failed to take into account temporal complexities. Moreover, with SPARQL, the need for intrinsic inferencing dropped fairly dramatically. Without that use case, though, many of the benefits of RDF fall by the wayside.
Lack of Intrinsic Sequencing. RDF works upon the (admittedly valid) assumption that in a graph there is no intrinsic ordering. It is possible to create extrinsic ordering by creating a linked list, but because path traversal order is not in fact uniformly respected, retaining this via SPARQL is not guaranteed. Since there are a great number of operations where sequencing is in fact very important, this limitation is a significant one, and in a world where object databases (which support arrays or sequences) are increasingly the norm, there are many analytics-related activities that simply cannot be done on the current crop of knowledge bases.
Use Case Failures. I’ve been involved in a number of semantics projects over the years. Most of them had, at best, mixed success, and several have been abject failures that have since been superseded by other technologies. Natural language processing seemed, a decade ago, to be a bright spot in the semantic web firmament, but if you look at the field ten years later, most of the real innovations have had to do with machine learning, from BERT to the latest GPT-3. There are places where graph technology has made huge inroads, but increasingly those areas are built around labeled property graphs, not rdf graphs. There are areas where knowledge graphs could make a huge difference (compliance modeling, for instance), but when no one can agree to what exactly those models look like, it’s not surprising that areas such as smart contracts are simply not getting off the ground.
Poor Format Interoperability. RDF is an abstraction language, but it has been dependent upon various and sundry representations, many of which have … issues. RDF-XML made even hardened XML users squeamish. Turtle is an elegant little language until one has to manage namespaces, but it has comparatively few people who have adopted it, and it doesn’t do terribly well in an environment where JSON is the dominant mode of communication. JSON-LD was a nice try. In most cases, the issues involved come down to the fact that JSON is an object description language that assumes hierarchical folding, while RDF is fundamentally normalized, and that especially with complex directed graphs, the boundaries between objects is far from clearcut in many cases.
Lack of Consistent Ingestion. This is a two-fold problem. It is hard to ingest non-RDF content into an RDF form. Part of this has been that the process of ingestion was never really defined from the outset since the assumption at the time was that you loaded in Turtle (or even more primitive representations) and then made use of inference upon an existing body of assertions. Once you move beyond the idea of static content, then all of the complexities of transactional databases have to be solved for triple stores as well. There have been many good solutions, mind you, but there was no real uniformity that emerged.
Graph databases are powerful tools, especially in a world of high data connectivity, but it is increasingly becoming evident that even for knowledge bases, it’s time to refactor.
The Promise of GraphQL
I started working with RDF about the time that I came to a realization about another query language (XQuery) and the nature of documents in an XML world. An XML database is typically a collection of documents, each with its own URL. That URL (uniform resource locator) was also an IRI (international resource identifier). For narrative documents, the assumption that every subdocument (such as a chapter) was self-contained in the base document was generally a valid one, though there were exceptions (especially in textbook publishing). Once you start dealing with documents that describe other kinds of entities, however, this assumption broke down, especially when multiple containers referenced the same subdocument.
While the concept of the IRI is a fundamental one in XML, it took a while to build a semantic linking language, and there were several different attempts that went off in different directions (xlink, rdf, rdfa, xpointer and so forth). Part of the reason for this confusion comes from the fact that most people don’t differentiate even now between a link pointer to a node in a communications network (the Internet, or some subsection thereof) and a link pointer to a node in a knowledge (or conceptual) network. Nor should this be that surprising – it’s not a distinction that usually comes up in database theory, because most databases are internally consistent with respect to references (aka keys or pointers), and the idea of conceptual links makes no sense in a SQL database as a consequence.
Additionally, if you are used to working with document object serializations, such as JSON, then the idea of having to create complex queries just to get specific objects of a given type seems like a lot of work, especially when the result comes back normalized (e.g., in discrete, identified blocks) rather than in hierarchical documents – and especially when you could already get back the same thing from a JSON database such as Couchbase or ArrangoDB.
For the most part, in fact, what developers want is a way to get at a particular part of a document, applying transformations to that document as needs be, without having to worry about stitching a set of components together. Similarly, they want to be able to post content in such a way that it can be checked for validity. They could do this with XQuery, but XML notationally is seen as too heavy-weight (an argument with comparatively little merit), whereas JSON fits into the paradigm that they are most familiar with syntactically.
This is what GraphQL promises, and for a fairly wide array of use cases, this is what both programmers and data scientists want.
What is significant about this is that GraphQL manages to accomplish much of what Sparql and the RDF stack promised but never fully delivered. Specifically,
- Ease of Use. GraphQL requires both a client and a server to build the query, but that client makes schematic discovery relatively simple,
- Data Store Agnostic. GraphQL can work on a relational database, a triple store, an Excel document, or a data service, for both ingestion and query.
- Transformable. There is a limited capability to perform transformations on the data set through both the query and mutation language.
- Mutable. Mutations for updating content on the server can be accomplished through a mutational query that is again system agnostic.
- Schematic. While not quite as robust as RDF, GraphQL makes use of TypeScript or JSON Schema to specify the schematic relationships, and can be validated prior to entry.
- JSON-Centric. A decade ago, there was still some question about whether XML or JSON would predominate. Today, there is no real question – for non-narrative content, JSON has pretty much won, while XML is (not surprisingly) still favored for narrative content, if not as heavily.
- Federated. It is possible (with extensions) to make GraphQL queries federated. SPARQL still has the edge here, but federation is also still not widely utilized even in RDF-land.
Put another way, GraphQL provides an abstraction layer that is good enough in most cases, and preferable in others (such as sequencing) to what SPARQL provides.
GraphQL and Knowledge Graphs
This does not necessarily mean that GraphQL will eliminate the need for knowledge graphs or RDF, but it does change the role that languages such as Sparql, Cypher, or similar dialects play. One aspect where this does play a significant role is in creating GraphQL schemas. A relational database schema is generally fixed by convention, while both XML and JSON schemas when they do exist, do so primarily as a means of initiating actions based upon compliance with rules. Typescript is simply not robust enough for that role when it involves constraint modeling, and given the variability involved in different data systems, it is likely that this particular function will remain the purview of the data store.
Similarly, inferencing involves the construction of triples through either an inference engine or a SPARQL script (or both). One of the major issues working with Triple Stores is the security aspect involved with queries. If, when the underlying data model changes, a SPARQL script is used to construct a TypeScript document from the RDF schema, then this actually provides a layer of protection. The RDF models an internal state, The GraphQL models an external representation of that state.
This also can also help with mutations, providing a proxy layer that can map between the internal and external presentation of the state of the knowledge graph. SPARQL Update is a powerful tool, but because that tool is so powerful it is one that many database administrators are reluctant to grant to external users. This also allows for the insertion of additional metadata and or the creation of maps between JSON content and RDF in a consistent and controlled manner, including the generation of consistent timestamps and IRIs.
Additionally, an increasing number of GraphQL implementations on Knowledge Graphs make use of the JSON-LD context object. This provides a way of maintaining IRIs without significantly impacting the utility of the JSON produced by GraphQL. That approach also solves one of the peskier aspects of JSON-LD, in that a GraphQL generated JSON structure can be denormalized, reducing the need to do so out of band.
The Future of Semantic GraphQL
Nonetheless, GraphQL will likely push the engines involved with RDF towards JSON/hybrid stores over the next few years. Triple stores in general are built around n-tuple indexes, though with additional indexes encoding intermediate structures optimized for JSON retrieval. These so-called hybrid databases can also present service layers to look like relational data stores, albeit with some potential lossiness.
This also points to a future where federated queries become less likely, rather than more, which shouldn’t be all that surprising. Federation has a lot of issues associated with it, from semantic ones (the challenge of standardizing on a particular ontology) to performance issues (latency in connections) to security and accessibility. However, with GraphQL, the cost of developing a translation layer becomes fairly low and the onus is put not on the data provider but the data consumer. Indeed, I can see an aftermarket for common ontology to ontology queries, which may very well mitigate one of the bigger headaches involved in linked data.
GraphQL may also end up being a bridge between semantic and labeled property graph (LPG) operations. While it is possible to do shortest path calculations in an RDF graph, it’s not the most efficient way of utilizing such graph (indeed, shortest path calculations, used in everything from traffic applications to genetic sequencing are essentially where LPGs excel. On the other hand, LPGs are at best indifferent for inferencing. Yet GraphQL could readily load LPGs with data pulled from property graphs, could perform multiple optimizations, then could return the results through a known interface.
Finally, it is possible that we’re on the right track with regards to true reasoning as a system of computation on logical formalisms, but my suspicion is that reasoning requires that you have context, the ability to work with fuzzy logic, and the kind of Bayesian analysis that seems to be the hallmark of machine learning. In other words, the semantic systems of today are likely at best very primitive approximations of where they will be in a decade or two. Indeed, this was something that mathematician Kurt Godel proved nearly a century ago. We learn from that, build on to it, and move on.
Regardless, I think it is safe to say that, regardless of where we end up, GraphQL will likely have an important part to play in getting there.
Kurt Cagle is the managing editor of Data Science Central.