05 April 2011

Pretty Good Uniqueness

Software developers are neurotic about uniqueness—no two files may share the same path, no two users the same ID. That’s probably good: we like money and email to go to right person.

Over in the Real World™, people are more relaxed. We tolerate quite a lot of ambiguity, relying partly on context to remove it, and partly on clarification when necessary–“Paris, France, not Paris, Texas”. We even tolerate a certain level of confusion and error as a reasonable price to pay for not always having to refer to each other by National Insurance number.

Terry Jones (not the Python, nor the Qu’ran burning pastor, but @terrycojones, the unorthodox visionary behind Fluidinfo) frequently says that he wants to make working with information in computers more like working with information in the Real World™. It’s a useful goal.

Almost from the first moment I heard about Fluidinfo, with its model of information sharing based on tagging common objects, I’ve been interested in (some might might say obsessed with) the question of how to map Real-World™ objects and concepts (like Paris, Animal Farm, The Eiffel Tower, Existential Philosophy and the ring on my finger) to Fluidinfo objects, romantically identified, as they are, by 128-bit integers (hubristically so-called ‘universally unique identifiers’ [UUIDs]) such as 6387ab3f-e3d5-4ca9-bd13-ae3f-fd9c1830.

Fluidinfo’s about tag (fluiddb/about, to give it its full name) was created specifically to make it easier to decide where to put information in Fluidinfo. Every object in Fluidinfo, when it’s created, can optionally have this about tag set to a unicode string and Fluidinfo guarantees that about tags are unique, i.e. that no two different objects will ever share an about tag. As a result, you can directly address objects in Fluidinfo by specifying an about tag. For example, http://fluiddb.fluidinfo.com/about/Paris is the URL for the Fluidinfo object with the about tag “Paris” (UUID 17ecdfbc-c148-41d3-b898-0b5396ebe6cc, since you ask).

Fluidinfo, by Terry’s very specific design, does not force anyone to use about tags in any specific way. Any Fluidinfo user can attach any information to any Fluidinfo object she likes. If user jacqui decides to attach information about Paris, Texas to the Paris object above, and gemma chooses to use it to store information about Paris, France, that is entirely fine. It’s even fine of Fluidinfo user anarchist decides to store information about Birmingham (Alabam), or existential philosophy, or her entire record collection on the same object. There will be no one from Fluidinfo complaining or banning or undoing (though it’s possible that those with acute hearing may perceive a quiet “tsk, tsk” sound emanating from the author of this blog).

I believe, however, that most Fluidinfo users will want there to be conventions for about tags that will encourage information about the same thing to be stored on a well-defined common object, and for information about different things to be stored on different objects. Of course, we won’t always get those conventions right first time, and they will evolve over time, but my feeling is that a few hours of thinking can avoid many, many hours of trial error. The question is: what should those conventions be?

My feeling is that what we need to aim for is “pretty good uniqueness”, a concept that might be compared loosely to “pretty good privacy” or “probabilistically approximately complete” learning. I don’t have a formal definition, nor even a very good rule of thumb, but I think we need to aim for a set of about tag conventions that are easy to use and which mean that collisions are very rare, but that we should not aim for absolute uniqueness, as to do so would lead inexorably to conventions that are much less appealing to humans. In other words, we should aim to make about tag conventions lie in a sweet spot somewhere between the computer programmer’s “absolute, guaranteed, uniqueness in all circumstances” and the Real-World™, human-style “let’s not worry about it too much and just deal with collisions when they occur”.

The nearest I have to a rule of thumb is that when you’re uploading a reasonably large quantity of data to Fluidinfo (say, some tens of thousands of objects), most of the time, you should not encounter a conflict. I’m not sure how to quantify this. If 1% of items have conflicting about tags, I’m pretty clear that this is much to high a collision rate. And I’m pretty clear that 1-in-a-billion is OK. My guess is that it is probably good enough to aim for collision rates below about 1-in-a-million. But that’s just a feeling.

This can be made more concrete with some examples. One convention I suggested that seems to be being used quite widely and successfully is for books (as works, rather than individual editions, printings etc.). The basic form of this is to combine a ‘book:’ prefix with a normalized title and author. The normalization aims to remove ambiguity with case, punctuation etc., to make it more likely that different people will arrive at the same about tag, without significantly affecting uniqueness or legibility. So an example about tag for a book is:

book:nineteen eighty four (george orwell)

Notice that the (troublesome) hyphen that we would normally include when writing “nineteen eighty-four” has been removed, as have capitals (there’s a library available to do the standardization, which can be used in python) or online.)

[The original version of the convention (book-1) also removed all accents from letters in an effort to reduce further the likelihood of minor variations; however, when Nicholas Tollervey (@ntoll) and Terry started publishing large volumes of book data that included some non-European names it became clear that this convention sometimes went a normalization too far, so the (so-far undocumented) book-u variant convention was born, in which letters are mapped to lower case, but accents are preserved. (This is supported in the python library, but not yet in the web app.)]

These conventions for about tags for books seem to me to hit the sweet spot I was talking about. Book titles, alone, are definitely not sufficiently unique in two different respects: first, it is not uncommon for different authors to write books with the same title; secondly book titles (alone) are frequently shared with other (non-book) items, like films, people, places etc. However, by combining a prefix (book:) that specifies the class of object, together with the title and the author (all normalized), we get something that feels, for practical purposes, pretty good uniqueness. I would be surprised if there are not examples of pairs of books that share both author and title, but I suspect those are so rare that they will cause us little trouble and (personally) feel quite content to do some ad hoc disambigation to handle those cases.

Indeed, the pattern of a class prefix, a main identifier, and a disambiguator, feels like a useful pattern for many kinds of Real-World™ entities to me. I’ve been discussing films, for example, with Michael Hawkes, in the comments on another blog post, and there is seems that using either film:title (year) or film:title (director) will probably work well. Again, there might be cases in which two directors sharing a name produce films of the same name, or in which two films of the same name are produced in the same year, but these seem likely to be so rare that ad hoc disambiguation of those cases might be acceptable. It is also, of course, not a coincidence that in the real world films are often identified by title and year or title and director. Michael and I both lean toward year as probably the better disambiguator, so I suspect I will soon be proposing film:title (year) as a convention; though American readers might prefer a “movie:” prefix.

For me, the other great virtue of this style of about tag is that it is very easy to construct the canonical about tag using only information that the user might reasonably expect to have at hand, rather than depending on some kind of external lookup. To labour the point, if I want to tag a book, I probably know the title and author, and can certainly find that information in the book. With a film, I concede, it would be less unusual to know the title but not the year or director, but even there, this data is easily available from multiple sources, crucially including from the film itself.

Perhaps unsurprisingly, there are those who feel that the whole notion of trying to organize, specify, or guide conventions is objectionably authoritarian and/or pointless, and that it would be much better simple to see what emerges organically. (Terry has been known to accuse me of “fascist librarian” tendencies, though I sure he means it in the nicest possible way.) Terry and I both studied so-called genetic algorithms, in which evolutionary processes are simulated on computers to tackle search and optimization tasks, and we are both impressed with the power of evolutionary mechanisms. I, however, fear that Fluidinfo doesn’t have the luxury evolutionary timescales to succeed, and therefore tend to favour trying to help evolution along a little. If you don’t, just ignore all this, do your own thing, and pay no attention to the annoying tsking from Scotland.


  1. "Pretty good uniqueness". I like it. Also, while you're probably right about film: vs. movie: and Americans generally, I personally have no problem with film: as the class prefix there.