29 April 2011

Wikifying Fluidinfo

One of the descriptions people have been known to use for Fluidinfo is

The database with the heart of a wiki.

I have hated that description from the first time I heard it. For me, the defining, central idea of a wiki is that it offers a single version of the truth that anyone can change. Fluidinfo isn’t like that: anyone can write to it, but each user writes in her own, private space; Fluidinfo offers as many versions of the truth as there are users. There are no edit wars in Fluidinfo.

Sometimes, this is brilliant. Rhiannon sticks her opinion in rhiannon/opinion and I stick mine in njr/opinion. For information that is personal—whether because it represents an opinion, or something to do with the user, or something of interest only to one person and a few friends—storing information in namespaces is perfect. On top of that, the permissions system adds a powerful layer of flexibility.

Where it feels unnatural is when we are recording facts. For example, I published the periodic table to Fluidinfo under the miro namespace (Miró being the data analysis software produced by my company, Stochastic Solutions; probably the only analytics software in the world with native integration to Fluidinfo at this time). So now, if you go to the object with about tag element:Hydrogen (id 270a8269-f02d-4925-b152-da3934edaa43) you will find lots of useful data about Hydrogen. (It was all culled from about seven different pages of not-very-structured data in Wikipedia, that would be nightmare for a machine to use.)

It looks like this;

daisyElementHydrogen.png

and like this:

fdb tags -F -a 'element:Hydrogen'
Object with about="element:Hydrogen":
/objects/270a8269-f02d-4925-b152-da3934edaa43
  fluiddb/about = "element:Hydrogen"
  njr/index/class = "element"
  miro/elements/Etymology = "Greek hydrogenes"
  miro/elements/Period = 1
  miro/elements/Group = 1
  miro/elements/AtomicWeight = 1.007947
  miro/elements/RelativeAtomicMass = 1.007947
  miro/class = "record"
  miro/elements/MeltingPointC = -258.975
  miro/elements/Description = "gas"
  miro/elements/Symbol = "H"
  miro/elements/BoilingPointF = -423.17
  miro/elements/BoilingPointC = -252.87
  miro/elements/db-record-number = 1
  miro/elements/ChemicalSeries = "Nonmetal"
  miro/elements/Name = "Hydrogen"
  miro/elements/Z = 1
  miro/elements/Colour = "colorless"
  miro/elements/db-next-record-about = "element:Helium"
  njr/rating = 10
  njr/index/about
  miro/elements/MeltingPointKelvin = 14.2
  miro/elements/Density = 8.988e-05

But that’s crazy. Hydrogen has atomic number 1; its symbol is H; its relative atomic mass really is somewhere near 1.007947. These simple facts have nothing to with Miró, or njr. In fact, if they turn out to be wrong, matters are even worse, because no one except Miró can change them.

For all Fluidinfo’s elegance and power, it encourages information to be ghettoized and personalized even when people are really wanting just to add uncontentious, factual data. In doing so, it makes it hard for others to correct, extend and improve it, and harder for it to be found and used. It’s natural that if you want to know how I rate something, you would look at a tag called njr/rating; but how could you know that you need to look under a user called miro to find information about Hydrogen? It’s a problem.

What if Fluidinfo Were Actually More Like a Wiki?

Fluidinfo comes with a fairly powerful permissions system that allows detailed control over who can read and write each tag and namespace. A tag can be set so that only its owner can write to it, or so that everyone can, or so that only a named set of users can write to it.

This means that Fluidinfo already has a core piece of infrastructure for enabling some wiki-like functionality. At the simplest level, we could have a user whose top-level namespace gave write permission to everyone, and we could augment this with a policy that made that apply recursively to all sub-namespaces and tags under it. Perhaps we would call that user wiki. (I could actually create such a user; or you could; Terrell probably won’t.)

The suggestion would then be that anyone who wanted to publish factual information to Fluidinfo consider defaulting to writing it under the wiki namespace. Instead of

Is that enough?

I confess that if anyone had described Wikipedia to me before it existed, I would have been a naysayer; I would simply never have believed that its anarchical processes could possibly have produced anything of value. Clearly I was wrong; in practice, there is a vast amount of useful information in there, and the level of accuracy of information in Wikipedia is remarkably high.

Wikipedia deals with vandalism largely through the work of humans undoing vandalising changes, often with breathtaking speed. Fluidinfo does not, today, have a mechanism for tracking and reverting changes, though it would clearly be possible to create such a mechanism. (I’m not even sure what transaction records Fluidinfo keeps today, but the great thing about software is that if it doesn’t keep enough today, it could be changed so that it did tomorrow.)

Even today, Fluidinfo’s permissions system does offer some powerful controls. We could allow everyone write access to the whole of the wiki namespace by default, but remove abusers. Or we could be more restrictive. We could have groups controlling different tags or namespaces under wiki. We could allow people in cautiously, perhaps requiring endorsement first. There are many possibilities. In the absence of an easy way to revert vandalising changes, we might need to lean towards being more restrictive early on; if reversion capabilities were added to Fluidinfo, together with detailed tracking of which user makes which changes, we might be able to become more liberal.

We could also have loose restrictions when we are trying to build up information in some domain, and potentially tighten them later. (After all, the periodic table is fairly stable, as are the planets, even if Pluto’s not a planet any more.)

Is this a Good Idea?

I don’t know.

I think there is a clear need for Fluidinfo to have some mechanism for detaching non-personal information from a (personal) user; for making reference information available in a more predictable, uniform, cooperatively gathered way. This can happen to some extent through individual initiatives, but something like the idea of a wiki user seems like a possible improvement.

On the other hand, this may just be a recipe for edit wars.

Overall, it seems to me that Wikipedia, and other similar collaborative projects, stand as a kind of existence proof that wiki-like mechanisms can work. So on balance, at the moment, I think there might be some benefit in trying something like the idea of a wiki user.

27 April 2011

Terry's Query

Regular readers of this blog will know that I have an special interest in conventions for Fluidinfo tags and values—especially the about tag—that some might consider borderline obsessive. Ironically, much of this focus comes from thinking about the canonical query that Terry (@terrycojones) always used to introduce Fluidinfo with. It usually went something like this:

Find me all the books that Russell (@rustlem) has rated more than 8 that I haven’t read.

This typically gets translated into a query like

rustlem/rating > 8 except terrycojones/has-read

I would argue

  1. that this isn’t really the right query
  2. that it doesn’t necessarily bring back the information Terry would really want
  3. that for it to work requires more some conventions

Let’s take these in turn.

1. The Wrong Query: Specifying the Kind of Object

At the simplest level, Terry’s English-language query specifies books, but his Fluidinfo query doesn’t. So as a bare minimum, Terry needs to add something to restrict the query to books.

My own favoured approach (as exemplified by both the book-1 and book-u conventions) is to prefix the about tag with book:. In principle, this would allow a query to pull out books only. An approximation would be to use the query:

(fluiddb/about matches "book:" and rustlem/rating > 8)
except terrycojones/has-read

though the way that the match operator works today actually throws away punctuation, so this really just restricts it to about tags containing book, rather than those that contain “book:”. I hope that in time we’ll get some alternative string matching operators so that we could, for example, use a regular expression such as

(fluiddb/about =~ /^book:.*$/ and rustlem/rating > 8)
except terrycojones/has-read

or perhaps an operator such as starts-with:

(fluiddb/about starts-with "book:" and rustlem/rating > 8)
except terrycojones/has-read

Of course, there are plenty of other ways we might identify books. These include:

  • Looking for a tag that somehow indicates the kind of object, if we know such a tag and trust it to be reasonably thorough; for example, I have started an embryonic classification project using the tag njr/index/class. It hasn’t got very far yet, but a lot twitter users have their njr/index/class set to “twitter-user” and a few books (ironically, not the ones starting book: yet, but some others) have a class of book. I hope, over time, to expand this to cover most objects with about tags in Fluidinfo.

    (I keep wondering whether “kind” would have been a better word than “class”. But if I had chosen “kind”, I suspect I’d be worrying that “class” would have been better. At least I avoided “type”!)

  • An interesting variation of the above would be to have some kind group-writable tag for class. I’ve been talking to Terry and some of the other Fluidinfo people about the idea of a wiki user for Fluidinfo. Lots of people would have write access to some or all tags/namespaces under the wiki user. If this happened, the idea would be that when you want to put personal data into Fluidinfo, you would obviously use your own namespace, but if you were wanting to put factual/reference data in, you might do that using the wiki user. This would potentially solve the problem of knowing which tag to look for to find core information like a book’s author and title, a film’s director and year etc. As in wikipedia, there would be potential for spam and other abuse, but at least in Fluidinfo we have permissions and policies, so we could be permissive in allowing people to write to in the wiki namespaces but harsh in removing access for abusers. I’m not totally convinced by this idea, but I can see many virtues in it.

    If there were such wiki user, and it had a class tag, the query might become:

    (wiki/class = "book" and rustlem/rating > 8)
    except terrycojones/has-read
  • If we happen to know that Russell chooses to tag his books (only) with a particular tag such as has-read, we can add a clause like that to the query:

    (has rustlem has-read and rustlem/rating > 8)
    except terrycojones/has-read

    though in practice, it might be more likely that has-read would also be used for things like online articles.

  • If there were a recognized authoritative book user in Fluidinfo (perhaps isbn.org or amazon.com), we might be able to use the presence of one of its tags as an indicator. For example, if isbn.org were known to tag all (or enough) books with tags including isbn.org/book/title and isbn.org/book/author then we would be able to use something like

    (has isbn.org/book/title and rustlem/rating > 8)
    except terrycojones/has-read

    But this obviously depends on the existence of such an authoritative user.

  • If we had wildcarding on namespaces, we might also accept something like

    (has */book/title and rustlem/rating > 8)
    except terrycojones/has-read

    which would certainly add to the breadth of results, but might introduce quite a lot extra noise, especially once the system has a lot of users.

2. Pulling Back the Right Information

The second interesting issue with Terry’s query is how you get the right information back. In the original Fluidinfo API, all the query could return for you is the object ID; you’d then have to browse the object and look at its tags to understand what it is.

The current API offers richer alternatives, allowing objects to be specified by about tag and allowing particular tags to requested for an object matching a query, including the about tag.

If the about tag itself contains the key information (in the case of a book, the title and author, normalized) then the query can pretty much work just by specifying the objects of interest and requesting the about tag. You can actually do this today. For example, the following curl command (in which I’ve used njr instead of rustlem, since Russell hasn’t quite got around to rating anything in Fluidinfo yet) actually works:

fdb show -q '(fluiddb/about matches "book" and njr/rating > 8) except has terrycojones/has-read' /about
4 objects matched
Object 8cd6fd0a-40e1-4889-ac3f-2b3dbb6f861d:
  /fluiddb/about = "book:through the looking glass and what alice found there (lewis carroll)"
Object 1c3b1874-0413-4607-97db-74cb9c92dcbf:
  /fluiddb/about = "book:fugitive pieces (anne michels)"
Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
Object a78d77ce-a055-40e3-97a9-de4223858bd8:
  /fluiddb/about = "book:nineteen eighty four (george orwell)"

It should be noted, however, that it works as well as it does partly because all the books I have rated have about tags in my preferred (book-1) convention, so that the title and author are obvious and the class (book) is in the about tag.

I would argue that this is pretty good and meaningful, though of course we would also want information to be stored in separate tags for the title, author and other useful things like year etc. This is true for most, if not all, of the books matched here, but you have to know that the title and author information is under the username miro. Again, trusted authoritative users with known conventions will help here, as might wildcarding on namespaces.

The wiki user would also help. It turns out that three of the four books this query have title and author information under the miro/books namespace, so if you add these tags to the fdb request, you get most of what you want.

fdb show -q '(fluiddb/about matches "book" and njr/rating > 8) except has terrycojones/has-read' /about /miro/books/title /miro/books/author
4 objects matched

Object 8cd6fd0a-40e1-4889-ac3f-2b3dbb6f861d:
  /fluiddb/about = "book:through the looking glass and what alice found there (lewis carroll)"
  /miro/books/title = "Through the Looking-Glass, and What Alice Found There"
  /miro/books/author = "Lewis Carroll"

Object 1c3b1874-0413-4607-97db-74cb9c92dcbf:
  /fluiddb/about = "book:fugitive pieces (anne michels)"
  (tag /miro/books/title not present)
  (tag /miro/books/author not present)

Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
  /miro/books/title = "White Teeth"
  /miro/books/author = "Zadie Smith"

Object a78d77ce-a055-40e3-97a9-de4223858bd8:
  /fluiddb/about = "book:nineteen eighty four (george orwell)"
  /miro/books/title = "Nineteen Eighty-four"
  /miro/books/author = "George Orwell"

If the wiki user existed, this might be even more natural:

fdb show -q '(wiki/class matches "book" and njr/rating > 8) except has terrycojones/has-read' /about /wiki/title /wiki/author
4 objects matched

Object 8cd6fd0a-40e1-4889-ac3f-2b3dbb6f861d:
  /fluiddb/about = "book:through the looking glass and what alice found there (lewis carroll)"
  /wiki/title = "Through the Looking-Glass, and What Alice Found There"
  /wiki/author = "Lewis Carroll"

Object 1c3b1874-0413-4607-97db-74cb9c92dcbf:
  /fluiddb/about = "book:fugitive pieces (anne michels)"
  /wiki/title = "Fugitive Pieces"
  /wiki/author = "Anne Michels"

Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
  /wiki/title = "White Teeth"
  /wiki/author = "Zadie Smith"

Object a78d77ce-a055-40e3-97a9-de4223858bd8:
  /fluiddb/about = "book:nineteen eighty four (george orwell)"
  /wiki/title = "Nineteen Eighty-four"
  /wiki/author = "George Orwell"

(Note: the above query does not work; there is no wiki user at present.)

3. Tag Conventions

As well as the considerations around object class and pulling back the desired information, there is a final requirement of knowing what tags actually to include in the query, and what they mean. I’ve written about this before, but will recap briefly.

Essentially, we need to know

  1. what the tag is called
  2. what range of values it uses and which end of the scale is better.

For a particular friend, this is not too hard (but is still much easier if everyone uses the same conventions), but it becomes even more useful if essentially everyone uses the same conventions. If everyone uses the same conventions we can imagine a future API might support queries like

(fluiddb/about starts-with "book:" and */rating > 8)
except terrycojones/has-read

or perhaps

(fluiddb/about starts-with "book:" and [njr, rustlem, ntoll]/rating > 8)
except terrycojones/has-read

or even

(fluiddb/about starts-with "book:" and mean(*/rating) > 8)
except terrycojones/has-read

if we simply trust that everyone is using a 0–10 scale for ratings. If we are worried about individuals skewing the system by using ratings on a 0–1,000,000, by adding some query complexity we can even imagine filtering those out, though I concede this will always be somewhat painful.

The Challenge

It’s the same old themes, but here I’m trying to illustrate just how directly all the Fluidinfo themes I keep harping on about relate to some of the most fundamental motivations for bulding this system in the first place. For those who believe all the stuff about conventions is a distraction that doesn’t matter, my challenge is: how else are we going to allow Terry’s query actually to work in Fluidinfo?

One possible answer lies within applications. Clearly, if a FluidBook application appears, that has its own conventions (whatever they may be) for storing information in Fluidinfo, it may be easy to get consistency within the data from that particular application. But Terry’s real dream involves allowing users to choose which applications to use and for data to be shared seamlessly across those applications. Again, it’s obvious that a few applications can agree conventions among themselves, but the most universal way of supporting interoperability is if conventions just exist at the level of Fluidinfo itself.

A wiki user might also provide another way forward.

14 April 2011

Choice and Conformity in fdb

I’ve just pushed a new version of fdb.py to the GitHub repository.

The main change in this version is that I’ve added support for allowing the user to choose whether to use what we might call Unix-style paths or Fluidinfo-style paths.

Until now, fdb.py has, as part of its shell-like functionality, deliberately provided an alternative view of Fluidinfo from the underlying structure. The main features of this “Unix-style view” are as follows:

  • Full (absolute) fdb.py tag paths start with a leading slash. So my rating tag would be /njr/rating rather than njr/rating.
  • A tag path without a slash is taken to be a relative path, currently always relative to the user’s namespace (though there are alternate versions where there is a notion of a current working namespace (CWD) which can be changed with a cd command). Thus, when using my credentials, the Fluidinfo tag njr/rating can be referred to as rating while ntolls rating is ntoll/rating.
  • /about is provided as a synonym for the special tag fluiddb/about (the about tag).
  • /id is provided as a pseudo-tag that will report the value of the object’s Fluidinfo ID.

This was not carried all the way: I didn’t re-write queries, but, for me at least, it saved much typing and pain when using fdb from the command line.

This release (1.33) maintains this behaviour by default, but allows the user to configure or tell the system that she would prefer to use genuine, regular all-American Fluidinfo-style paths. There are two ways to invoke this alternative behaviour:

  • If you would always prefer to user regular Fluidinfo-style paths, the best thing to do is to add a third line to the credentials file that fdb uses saying

    unix-style-paths false

    (I need hardly add that using true instead of false sets the opposite preference.)

  • Alternatively, if you just want to override the configured or default behaviour for a one-off command, use the command-line flags -F. Similarly, to override the behaviour to force Unix-style paths, use -U.

When you choose Fluidinfo-style-paths, this is what happens:

  • Only command-line commands are affected: if you use fdb.py through the API, nothing changes unless you work quite hard.
  • Any time you specify a path, it needs to be an absolute path. In Terry’s world, absolutely all paths are absolute.
  • /about is not accepted as a synonym for fluiddb/about
  • The only special case is /id. Since this is a useful pseudo tag (in my view), and since it has no namespace, the same trick works as when using unix-style paths. Thus you can request the tag /id and it will return the object ID.
  • Output as well as input is affected, i.e. tag paths will be reported without a leading slash.

Examples

Old (default) behaviour:

$ fdb tags -a "Eiffel Tower"
Object with about="Eiffel Tower":
/objects/93bd1999-0998-49cc-8004-af457ce34ce4
  /njr/location = "Paris"
  /fluiddb/about = "Eiffel Tower"
  /njr/index/about

Behaviour with -F or with unix-style-paths false

$ fdb tags -F -a "Eiffel Tower"
Object with about="Eiffel Tower":
/objects/93bd1999-0998-49cc-8004-af457ce34ce4
  njr/location = "Paris"
  fluiddb/about = "Eiffel Tower"
  njr/index/about

Setting and showing tags (old/default behaviour):

$ fdb tag -a "Eiffel Tower" rating=7
$ fdb show -a "Eiffel Tower" rating
Object with about="Eiffel Tower":
  /njr/rating = 7

Behaviour with -F or with unix-style-paths false

$ fdb tag -F -a "Eiffel Tower" njr/rating=8
$ fdb show -F -a "Eiffel Tower" njr/rating
Object with about="Eiffel Tower":
  njr/rating = 8

The same behaviour works with untag:

$ fdb untag -F -a "Eiffel Tower" njr/rating
zero:$ fdb show -F -a "Eiffel Tower" njr/rating
Object with about="Eiffel Tower":
  (tag njr/rating not present)

I may have missed something, but as far as I can see, this works reliably. If I have missed something, let me know.

So: if you’ve always liked the look of fdb but disliked its unix-style paths, now might be a good time to get it. And if you already use it, but would prefer to use Fluidinfo-style paths, just add

unix-style-paths false

as the third line of your credentials file.

fdb.py 1.30 — Embryonic /values API support

I’ve just pushed a new version of fdb.py to the GitHub repository.

This release doesn’t change the command line but does add support for the (not-so) new /values API.

The /values API is a huge step forward for Fluidinfo, and I should have started adding support ages ago. It allows bulk reading and writing of tags on groups of objects that can be specified with a Fluidinfo query vastly more efficiently than was possible before.

There are a few things to note about the implementation:

  • It is inconsistent with the rest of fdb.py at present; I aim to remedy this (see below).
  • In particular, all strings must be unicode for the new calls (input and output), tags being written must already exist (ouch!) and any tags must be specified using full (absolute) Fluidinfo paths with no leading slash (unlike everywhere else in fdb.py).
  • The new calls are not exploited by the command line commands show or tag, so those will run no faster. Obviously, I plan to change this over time, too.

Examples

The following code shows simple use of the two main new calls, get_values and tag_by_query.

import fdb

db = fdb.FluidDB()

# Get about tag and njr/rating for objects njr has rated < 2:
values = fdb.get_values(db, u'njr/rating < 2',
                        [u'fluiddb/about', u'njr/rating'])
print u'Low ratings:\n'
for v in values:
    print unicode(v), u'\n'

# Tag those same objects with njr/dislike = True.
# (Currently requires njr/dislike to exist; easy to create using
# the command line to tag a single object.

fdb.tag_by_query(db, u'njr/rating < 2',
                 {u'njr/dislike': True})

#
# Now get the ones that are disliked:
#
values = fdb.get_values(db, u'has njr/dislike',
                        [u'fluiddb/about', u'njr/rating', u'njr/dislike'])
print u'Disliked:\n'
for v in values:
    print unicode(v), u'\n'

When run, this produces the following (for me, right now; you won’t be able to run it as-is, because you don’t have write access to my dislike tag.)

$ python ex.py
Low ratings:

   fluiddb/about: book:foucaults pendulum (umberto eco)
              id: a98f2c80-ae5f-405a-a319-d47122ae9da3
      njr/rating: 1

   fluiddb/about: The_Beatles
              id: 5157c69e-ceaf-4e7c-9423-d67751d029d3
      njr/rating: 1

   fluiddb/about: book:beloved (toni morrison)
              id: 1ab066e8-c2a1-4769-9121-e3346849e7e4
      njr/rating: 1

   fluiddb/about: book:the lord of the rings (jrr tolkien)
              id: ff873602-e9a8-4f9a-a7d4-c0cfc394a120
      njr/rating: 1

   fluiddb/about: book:oranges are not the only fruit (jeanette winterson)
              id: 7aed1e67-a88e-439d-8a56-b2ab52c838ab
      njr/rating: 0

Disliked:

   fluiddb/about: book:foucaults pendulum (umberto eco)
              id: a98f2c80-ae5f-405a-a319-d47122ae9da3
     njr/dislike: True
      njr/rating: 1

   fluiddb/about: The_Beatles
              id: 5157c69e-ceaf-4e7c-9423-d67751d029d3
     njr/dislike: True
      njr/rating: 1

   fluiddb/about: book:beloved (toni morrison)
              id: 1ab066e8-c2a1-4769-9121-e3346849e7e4
     njr/dislike: True
      njr/rating: 1

   fluiddb/about: book:the lord of the rings (jrr tolkien)
              id: ff873602-e9a8-4f9a-a7d4-c0cfc394a120
     njr/dislike: True
      njr/rating: 1

   fluiddb/about: book:oranges are not the only fruit (jeanette winterson)
              id: 7aed1e67-a88e-439d-8a56-b2ab52c838ab
     njr/dislike: True
      njr/rating: 0

How fdb.py Will Change

Obviously, the fact that these calls are inconsistent with the rest of fdb.py is unfortunate; in fact, it’s plain terrible. I plan to make a number of changes to make this situation better.

First, I plan to change the whole of fdb.py to use unicode internally. This shouldn’t affect the command line much, except for making from unicode cases work better, but will affect users of the fdb.py API.

Secondly, I plan to allow users to use full, fdb.py-style absolute or relative paths for tags, to make it consistent with the rest.

Thirdly, like elsewhere in fdb.py, it will create tags if they don’t exist, as required.

Finally, I plan to add an option to fdb.py to allow the user to choose whether to use fdb.py/Unix-style paths or 100%-genuine, @terrycojones-approved Fluidinfo-style absolute paths only, with no leading slashes, requiring fluiddb/about and disallowing relative paths. I plan to make this choice available both through configuration option and a command-line flag (in the case of the command line), and with an extra initialization parameter for use of the API. The configuration will affect both specification of tags by the user and reporting of tags by the system. So there will be choice. (Maybe Terry will be so pleased he’ll add tag-creation-by-tag-writing for me as a “thank you”.)

Personally, I will continue to use Unix-style paths, because I find it so inconvenient not to have relative paths and to have to add fluiddb/ just to refer to the about tag; but others may prefer conformity.

For those who have been following the mailing list, I will also aim to add a few other aggregation functions (currently there is only count).

05 April 2011

Pretty Good Uniqueness

Software developers are neurotic about uniqueness—no two files may share the same path, no two users the same ID. That’s probably good: we like money and email to go to right person.

Over in the Real World™, people are more relaxed. We tolerate quite a lot of ambiguity, relying partly on context to remove it, and partly on clarification when necessary–“Paris, France, not Paris, Texas”. We even tolerate a certain level of confusion and error as a reasonable price to pay for not always having to refer to each other by National Insurance number.

Terry Jones (not the Python, nor the Qu’ran burning pastor, but @terrycojones, the unorthodox visionary behind Fluidinfo) frequently says that he wants to make working with information in computers more like working with information in the Real World™. It’s a useful goal.

Almost from the first moment I heard about Fluidinfo, with its model of information sharing based on tagging common objects, I’ve been interested in (some might might say obsessed with) the question of how to map Real-World™ objects and concepts (like Paris, Animal Farm, The Eiffel Tower, Existential Philosophy and the ring on my finger) to Fluidinfo objects, romantically identified, as they are, by 128-bit integers (hubristically so-called ‘universally unique identifiers’ [UUIDs]) such as 6387ab3f-e3d5-4ca9-bd13-ae3f-fd9c1830.

Fluidinfo’s about tag (fluiddb/about, to give it its full name) was created specifically to make it easier to decide where to put information in Fluidinfo. Every object in Fluidinfo, when it’s created, can optionally have this about tag set to a unicode string and Fluidinfo guarantees that about tags are unique, i.e. that no two different objects will ever share an about tag. As a result, you can directly address objects in Fluidinfo by specifying an about tag. For example, http://fluiddb.fluidinfo.com/about/Paris is the URL for the Fluidinfo object with the about tag “Paris” (UUID 17ecdfbc-c148-41d3-b898-0b5396ebe6cc, since you ask).

Fluidinfo, by Terry’s very specific design, does not force anyone to use about tags in any specific way. Any Fluidinfo user can attach any information to any Fluidinfo object she likes. If user jacqui decides to attach information about Paris, Texas to the Paris object above, and gemma chooses to use it to store information about Paris, France, that is entirely fine. It’s even fine of Fluidinfo user anarchist decides to store information about Birmingham (Alabam), or existential philosophy, or her entire record collection on the same object. There will be no one from Fluidinfo complaining or banning or undoing (though it’s possible that those with acute hearing may perceive a quiet “tsk, tsk” sound emanating from the author of this blog).

I believe, however, that most Fluidinfo users will want there to be conventions for about tags that will encourage information about the same thing to be stored on a well-defined common object, and for information about different things to be stored on different objects. Of course, we won’t always get those conventions right first time, and they will evolve over time, but my feeling is that a few hours of thinking can avoid many, many hours of trial error. The question is: what should those conventions be?

My feeling is that what we need to aim for is “pretty good uniqueness”, a concept that might be compared loosely to “pretty good privacy” or “probabilistically approximately complete” learning. I don’t have a formal definition, nor even a very good rule of thumb, but I think we need to aim for a set of about tag conventions that are easy to use and which mean that collisions are very rare, but that we should not aim for absolute uniqueness, as to do so would lead inexorably to conventions that are much less appealing to humans. In other words, we should aim to make about tag conventions lie in a sweet spot somewhere between the computer programmer’s “absolute, guaranteed, uniqueness in all circumstances” and the Real-World™, human-style “let’s not worry about it too much and just deal with collisions when they occur”.

The nearest I have to a rule of thumb is that when you’re uploading a reasonably large quantity of data to Fluidinfo (say, some tens of thousands of objects), most of the time, you should not encounter a conflict. I’m not sure how to quantify this. If 1% of items have conflicting about tags, I’m pretty clear that this is much to high a collision rate. And I’m pretty clear that 1-in-a-billion is OK. My guess is that it is probably good enough to aim for collision rates below about 1-in-a-million. But that’s just a feeling.

This can be made more concrete with some examples. One convention I suggested that seems to be being used quite widely and successfully is for books (as works, rather than individual editions, printings etc.). The basic form of this is to combine a ‘book:’ prefix with a normalized title and author. The normalization aims to remove ambiguity with case, punctuation etc., to make it more likely that different people will arrive at the same about tag, without significantly affecting uniqueness or legibility. So an example about tag for a book is:

book:nineteen eighty four (george orwell)

Notice that the (troublesome) hyphen that we would normally include when writing “nineteen eighty-four” has been removed, as have capitals (there’s a library available to do the standardization, which can be used in python) or online.)

[The original version of the convention (book-1) also removed all accents from letters in an effort to reduce further the likelihood of minor variations; however, when Nicholas Tollervey (@ntoll) and Terry started publishing large volumes of book data that included some non-European names it became clear that this convention sometimes went a normalization too far, so the (so-far undocumented) book-u variant convention was born, in which letters are mapped to lower case, but accents are preserved. (This is supported in the python library, but not yet in the web app.)]

These conventions for about tags for books seem to me to hit the sweet spot I was talking about. Book titles, alone, are definitely not sufficiently unique in two different respects: first, it is not uncommon for different authors to write books with the same title; secondly book titles (alone) are frequently shared with other (non-book) items, like films, people, places etc. However, by combining a prefix (book:) that specifies the class of object, together with the title and the author (all normalized), we get something that feels, for practical purposes, pretty good uniqueness. I would be surprised if there are not examples of pairs of books that share both author and title, but I suspect those are so rare that they will cause us little trouble and (personally) feel quite content to do some ad hoc disambigation to handle those cases.

Indeed, the pattern of a class prefix, a main identifier, and a disambiguator, feels like a useful pattern for many kinds of Real-World™ entities to me. I’ve been discussing films, for example, with Michael Hawkes, in the comments on another blog post, and there is seems that using either film:title (year) or film:title (director) will probably work well. Again, there might be cases in which two directors sharing a name produce films of the same name, or in which two films of the same name are produced in the same year, but these seem likely to be so rare that ad hoc disambiguation of those cases might be acceptable. It is also, of course, not a coincidence that in the real world films are often identified by title and year or title and director. Michael and I both lean toward year as probably the better disambiguator, so I suspect I will soon be proposing film:title (year) as a convention; though American readers might prefer a “movie:” prefix.

For me, the other great virtue of this style of about tag is that it is very easy to construct the canonical about tag using only information that the user might reasonably expect to have at hand, rather than depending on some kind of external lookup. To labour the point, if I want to tag a book, I probably know the title and author, and can certainly find that information in the book. With a film, I concede, it would be less unusual to know the title but not the year or director, but even there, this data is easily available from multiple sources, crucially including from the film itself.

Perhaps unsurprisingly, there are those who feel that the whole notion of trying to organize, specify, or guide conventions is objectionably authoritarian and/or pointless, and that it would be much better simple to see what emerges organically. (Terry has been known to accuse me of “fascist librarian” tendencies, though I sure he means it in the nicest possible way.) Terry and I both studied so-called genetic algorithms, in which evolutionary processes are simulated on computers to tackle search and optimization tasks, and we are both impressed with the power of evolutionary mechanisms. I, however, fear that Fluidinfo doesn’t have the luxury evolutionary timescales to succeed, and therefore tend to favour trying to help evolution along a little. If you don’t, just ignore all this, do your own thing, and pay no attention to the annoying tsking from Scotland.

Labels