Robin Berjon

The Missing Links

Don't Make Me Think (About Linked Data)

I've long been very much in love with the semantic web vision. Or at least, that's what I've been telling myself. I really like the idea of having lots of data out there, with links between items, in a way that I can process. Easily. And I can't say I care whether it's branded with capital "S" & "W" or not, or whether it becomes Linked Data or Data on the Web. I'm definitely not amongst the most knowledgeable on the topic, and it so happens (totally by chance) that I've only been involved with the existing fruits of this technology in relatively minor ways. But the point is: I've always been deeply fond of the idea, and I've been waiting. A lot.

During a thread on Twitter earlier today, the discussion veered towards this topic and it surfaced many small notes that had been floating around in my mind these past few years as I've been looking more closely at a variety of things from the Open Data world, microdata & RDFa, JSON-LD, and a few other such areas. This discussion eventually led Manu Sporny to post a thoughtful piece on G+. I do recommend you read it since it provides the full background, but for the lazy in your midst I will write the rest of this post as if you had not read Manu's.

Along with a few kindred souls, I run a small service called SpecRef. It's a very simple HTTP+JSON API that allows you to pass in a list of specification identifiers (often seen in specification cross-references as [FOO]) and returns a simple structure that you can use in order to insert bibliographical entries at the bottom your document. It is primarily used by the ReSpec publishing system, but it is increasingly being queried for other uses as well. In any case, it's small, sweet, simple, and all about the JSON.

In order to make the discussion we were having about Linked Data more concrete, I asked the proponents who were present: If we made SpecRef use JSON-LD instead of JSON, what would we gain? Manu found it a bit harsh but it wasn't meant to be. It's a fairly basic question: I have this system which isn't using your technology, if it did, how would it be improved?

Manu's response is essentially (and he will correct me if I'm wrong):

Now, as web hackers go, I think I have an abnormally high tolerance for working with technology that imposes an early tax in exchange for the promise of gains at some point in an unspecified future. I've worked with AppCache and XML Schema. I've even implemented parts of XML Schema. This blog is still running on ancient code that uses some Perl to glue together a bunch of XSLT 1.0. I picked up Erlang just to patch CouchDB. And some of you may think you dislike XML; I've worked on binary XML. In fact, on several formats for it. I've also designed fluid UIs in SVG. In other words, if you ever bump into my judgement I'd be grateful if you were to hold it for questioning.

Yet despite that, the above list really crystallised my feeling that there is something wrong in the way Linked Data is being approached. That was further compounded by the fact that Manu indicated that the JSON from SpecRef, because of its specific form, could not me automatically mapped into JSON-LD. Here is what gives me pause.

Reinventing the wheel

Using a few fields — title, authors, date — to describe something is not reinventing the wheel. It's just using a few fields. Reusing an existing vocabulary supposes that one knows how one's data will evolve. But a lot of the time you simply don't.

In this case, the initial version of the service four years ago (which was itself derived from another older one) did nothing more than map identifiers to HTML snippets that could be inserted directly into documents. That did the job, and it's all that was needed for a long while. As usage became more refined, more granular fields were added. But the cost of reuse still makes no sense for a title and a list of names. And by the time a vocabulary has grown enough that it would warrant reusing an existing option, it is already widely deployed and likely cannot be changed.

Had I known from the very first day four years ago that this service would evolve into something close enough to an existing vocabulary then I likely would have considered reusing something. Likewise, if this had been an HTTP API to describe the life history of every single moving part in a fleet of airliners, I would certainly have sought an existing option. But that was not the case, and there is no way in which I could have anticipated the path through which this grew. It may have turned into something different. It may not have grown. In fact, this was initially intended just for a handful of people in a single group.

It should therefore be a core tenet of linked data that publishers should not have to think about interoperability through existing vocabularies (unless they are specifically taking part in an existing, relatively predictable data community). If the system is predicated on people thinking about reuse before they can even start publishing, then it will largely fail — especially in reaching the vast amounts of small data that exist in the wild.

The Texas Toolchain Massacre

I am not conversant with the RDF toolchains available to my preferred platform (Node) these days, but I would be very surprised if they were any simpler than JSON.parse() followed by just accessing the data.

JSON-LD is certainly a step in the right direction here, but it is worth considering that as per Larry Wall's timeless saying not only should simple things be simple, but if you're introducing new technology that you want to see succeed then simple things should be at least as simple as they already are, and ideally simpler.

I should just be able to push my data out in a simple, common format (JSON, CSV) and allow people to just consume it with the tools they already know. If it later surfaces — as might well be likely — that there would be value in adding something here or there to make my data operate well with a given toolchain, that should 1) be simple to implement as an afterthought, and 2) not break existing usage. A system that keeps you free to open up your choices at any time later is far more resilient (and conducive to overall interoperability) than one that requires to think about what you're doing first. Linked data should be a layer on top of simple, pretty-much-raw data.

Munging Happens

If users of the data have an existing system, or if they have specific constraints on what they use, then indeed they will likely require conversion from my data to theirs. And I certainly won't deny that if I happen to know who will consume the data, then it would be better to publish according to those needs.

However the fact is, in this case as in many others, at the time when the service was built there was no way of knowing who might consume it (other than my own tools). If it had been certain that it would be librarians, then I may have considered tailoring the output to them. But it could equally well have been people manipulating software tests who might need to make assertions about what is being tested and these might have required a different vocabulary. (The service is in fact used by both, which is why the awesome Tobie Langel has been evolving it a lot.)

The problem is that if I am required to pick a vocabulary before deploying my service, I need to live not so much with open world assumptions as with predictable world assumptions. Predictably, that won't work.

One of the greatest values in publishing reusable data is that you know neither who will want to use it nor how. Because of that, unless it is obvious that you're targeting a given community, the chances are that it is not worth thinking about how to fit your model into a shared one. The first order of business is to do a good job getting the data out there, and the best way of doing that is likely to simply expose something close to your own internal model (which isn't to say that you shouldn't learn from how things like yours are commonly modelled). The odds are very high that conversion will be needed no matter what, for at least some of your users (and not unlikely for most). A resilient linked data ecosystem needs to treat data conversion as a natural, core, and common part of everyday life. Munging happens. It just always does.

In A Big, Big World

The SpecRef data can be annotated in an open world, decentralised extensibility manner. All that that requires is the ability to point to stuff. URLs rock, as does JSON Pointer. Want to know the title of the "DAHUT" specification? http://specref.jit.su/bibrefs?refs=DAHUT#/DAHUT/title. Or maybe who the first author is? http://specref.jit.su/bibrefs?refs=DAHUT#/DAHUT/authors/0.

Yes, there's a bunch of things that RDF can give you here that this doesn't. But the important fact is: you can point at it. All the rest can be layered on top of that.

Where Do We Go From Here?

So that's all fair and well as far as criticism goes, but how can we actually make this a happier place?

I don't claim to have the answer, or even a draft of one, but my first instinct is to provide something that targets the impedance mismatches. Handling data is full of them. You need a given data model for your DB, then another to push it out on the wire. You need to generate forms and related UI items for editing. And you need to validate the data at various stages, probably in somewhat different ways.

The RDF stack somehow tries to do that, but it does so in a way that's convenient to use in almost no existing system except RDF.

So start with JSON. It works well in every sane language (and several insane ones). Throw in at least some of what's in JSON Schema for validation purposes. You're working on the Web so the natural way of interacting with data is HTML forms. So tweak what you have to work well with the sort of constraints and types that can be expressed in forms, for common Web editing tasks, notably taking client-side validation into account. At that point you likely have something that can usefully be mapped to typical storage (e.g. Mongo, Couch, or your garden variety ORM/RDB) but might require a few extra bits of information (e.g. that a given field ought to be unique — which in turn can be used in the form to check availability immediately, etc.).

You then need a way to link between entities. JSON Ref can be a good choice here (http://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03). You likely don't want links everywhere, just at predictable places in your data model. Note that this has the interesting property that you do in fact end up with a graph, but it's a graph built of familiar, largely self-contained tree objects that are easy to manipulate by people who like yours truly have been using basic programming data structures forever.

Once you have a schema that gives you all that, you can start thinking about conversions. Ideally it's not something the publisher needs to do; anyone should be able to publish conversions. We can get to that later. Maybe that's a job for BEM's XJST.

This is just a personal hack and is in no way complete (it only supports features I've needed or that have been trivial to add): Web Schema. I've definitely used it for validation and for automatic (rich) form generation, and I'm looking at producing Mongoose schemata from it (which is pretty trivial). I'm simply sticking it out there as a potential brick in building a linked data system that I believe stands a greater chance of ubiquitous deployment.

Yes, it's best effort, it's informal, and it operates more through social constructs than technical enforcement. In other words, it's built for the Web.