27 April 2011

Terry's Query

Regular readers of this blog will know that I have an special interest in conventions for Fluidinfo tags and values—especially the about tag—that some might consider borderline obsessive. Ironically, much of this focus comes from thinking about the canonical query that Terry (@terrycojones) always used to introduce Fluidinfo with. It usually went something like this:

Find me all the books that Russell (@rustlem) has rated more than 8 that I haven’t read.

This typically gets translated into a query like

rustlem/rating > 8 except terrycojones/has-read

I would argue

  1. that this isn’t really the right query
  2. that it doesn’t necessarily bring back the information Terry would really want
  3. that for it to work requires more some conventions

Let’s take these in turn.

1. The Wrong Query: Specifying the Kind of Object

At the simplest level, Terry’s English-language query specifies books, but his Fluidinfo query doesn’t. So as a bare minimum, Terry needs to add something to restrict the query to books.

My own favoured approach (as exemplified by both the book-1 and book-u conventions) is to prefix the about tag with book:. In principle, this would allow a query to pull out books only. An approximation would be to use the query:

(fluiddb/about matches "book:" and rustlem/rating > 8)
except terrycojones/has-read

though the way that the match operator works today actually throws away punctuation, so this really just restricts it to about tags containing book, rather than those that contain “book:”. I hope that in time we’ll get some alternative string matching operators so that we could, for example, use a regular expression such as

(fluiddb/about =~ /^book:.*$/ and rustlem/rating > 8)
except terrycojones/has-read

or perhaps an operator such as starts-with:

(fluiddb/about starts-with "book:" and rustlem/rating > 8)
except terrycojones/has-read

Of course, there are plenty of other ways we might identify books. These include:

  • Looking for a tag that somehow indicates the kind of object, if we know such a tag and trust it to be reasonably thorough; for example, I have started an embryonic classification project using the tag njr/index/class. It hasn’t got very far yet, but a lot twitter users have their njr/index/class set to “twitter-user” and a few books (ironically, not the ones starting book: yet, but some others) have a class of book. I hope, over time, to expand this to cover most objects with about tags in Fluidinfo.

    (I keep wondering whether “kind” would have been a better word than “class”. But if I had chosen “kind”, I suspect I’d be worrying that “class” would have been better. At least I avoided “type”!)

  • An interesting variation of the above would be to have some kind group-writable tag for class. I’ve been talking to Terry and some of the other Fluidinfo people about the idea of a wiki user for Fluidinfo. Lots of people would have write access to some or all tags/namespaces under the wiki user. If this happened, the idea would be that when you want to put personal data into Fluidinfo, you would obviously use your own namespace, but if you were wanting to put factual/reference data in, you might do that using the wiki user. This would potentially solve the problem of knowing which tag to look for to find core information like a book’s author and title, a film’s director and year etc. As in wikipedia, there would be potential for spam and other abuse, but at least in Fluidinfo we have permissions and policies, so we could be permissive in allowing people to write to in the wiki namespaces but harsh in removing access for abusers. I’m not totally convinced by this idea, but I can see many virtues in it.

    If there were such wiki user, and it had a class tag, the query might become:

    (wiki/class = "book" and rustlem/rating > 8)
    except terrycojones/has-read
  • If we happen to know that Russell chooses to tag his books (only) with a particular tag such as has-read, we can add a clause like that to the query:

    (has rustlem has-read and rustlem/rating > 8)
    except terrycojones/has-read

    though in practice, it might be more likely that has-read would also be used for things like online articles.

  • If there were a recognized authoritative book user in Fluidinfo (perhaps isbn.org or amazon.com), we might be able to use the presence of one of its tags as an indicator. For example, if isbn.org were known to tag all (or enough) books with tags including isbn.org/book/title and isbn.org/book/author then we would be able to use something like

    (has isbn.org/book/title and rustlem/rating > 8)
    except terrycojones/has-read

    But this obviously depends on the existence of such an authoritative user.

  • If we had wildcarding on namespaces, we might also accept something like

    (has */book/title and rustlem/rating > 8)
    except terrycojones/has-read

    which would certainly add to the breadth of results, but might introduce quite a lot extra noise, especially once the system has a lot of users.

2. Pulling Back the Right Information

The second interesting issue with Terry’s query is how you get the right information back. In the original Fluidinfo API, all the query could return for you is the object ID; you’d then have to browse the object and look at its tags to understand what it is.

The current API offers richer alternatives, allowing objects to be specified by about tag and allowing particular tags to requested for an object matching a query, including the about tag.

If the about tag itself contains the key information (in the case of a book, the title and author, normalized) then the query can pretty much work just by specifying the objects of interest and requesting the about tag. You can actually do this today. For example, the following curl command (in which I’ve used njr instead of rustlem, since Russell hasn’t quite got around to rating anything in Fluidinfo yet) actually works:

fdb show -q '(fluiddb/about matches "book" and njr/rating > 8) except has terrycojones/has-read' /about
4 objects matched
Object 8cd6fd0a-40e1-4889-ac3f-2b3dbb6f861d:
  /fluiddb/about = "book:through the looking glass and what alice found there (lewis carroll)"
Object 1c3b1874-0413-4607-97db-74cb9c92dcbf:
  /fluiddb/about = "book:fugitive pieces (anne michels)"
Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
Object a78d77ce-a055-40e3-97a9-de4223858bd8:
  /fluiddb/about = "book:nineteen eighty four (george orwell)"

It should be noted, however, that it works as well as it does partly because all the books I have rated have about tags in my preferred (book-1) convention, so that the title and author are obvious and the class (book) is in the about tag.

I would argue that this is pretty good and meaningful, though of course we would also want information to be stored in separate tags for the title, author and other useful things like year etc. This is true for most, if not all, of the books matched here, but you have to know that the title and author information is under the username miro. Again, trusted authoritative users with known conventions will help here, as might wildcarding on namespaces.

The wiki user would also help. It turns out that three of the four books this query have title and author information under the miro/books namespace, so if you add these tags to the fdb request, you get most of what you want.

fdb show -q '(fluiddb/about matches "book" and njr/rating > 8) except has terrycojones/has-read' /about /miro/books/title /miro/books/author
4 objects matched

Object 8cd6fd0a-40e1-4889-ac3f-2b3dbb6f861d:
  /fluiddb/about = "book:through the looking glass and what alice found there (lewis carroll)"
  /miro/books/title = "Through the Looking-Glass, and What Alice Found There"
  /miro/books/author = "Lewis Carroll"

Object 1c3b1874-0413-4607-97db-74cb9c92dcbf:
  /fluiddb/about = "book:fugitive pieces (anne michels)"
  (tag /miro/books/title not present)
  (tag /miro/books/author not present)

Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
  /miro/books/title = "White Teeth"
  /miro/books/author = "Zadie Smith"

Object a78d77ce-a055-40e3-97a9-de4223858bd8:
  /fluiddb/about = "book:nineteen eighty four (george orwell)"
  /miro/books/title = "Nineteen Eighty-four"
  /miro/books/author = "George Orwell"

If the wiki user existed, this might be even more natural:

fdb show -q '(wiki/class matches "book" and njr/rating > 8) except has terrycojones/has-read' /about /wiki/title /wiki/author
4 objects matched

Object 8cd6fd0a-40e1-4889-ac3f-2b3dbb6f861d:
  /fluiddb/about = "book:through the looking glass and what alice found there (lewis carroll)"
  /wiki/title = "Through the Looking-Glass, and What Alice Found There"
  /wiki/author = "Lewis Carroll"

Object 1c3b1874-0413-4607-97db-74cb9c92dcbf:
  /fluiddb/about = "book:fugitive pieces (anne michels)"
  /wiki/title = "Fugitive Pieces"
  /wiki/author = "Anne Michels"

Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
  /wiki/title = "White Teeth"
  /wiki/author = "Zadie Smith"

Object a78d77ce-a055-40e3-97a9-de4223858bd8:
  /fluiddb/about = "book:nineteen eighty four (george orwell)"
  /wiki/title = "Nineteen Eighty-four"
  /wiki/author = "George Orwell"

(Note: the above query does not work; there is no wiki user at present.)

3. Tag Conventions

As well as the considerations around object class and pulling back the desired information, there is a final requirement of knowing what tags actually to include in the query, and what they mean. I’ve written about this before, but will recap briefly.

Essentially, we need to know

  1. what the tag is called
  2. what range of values it uses and which end of the scale is better.

For a particular friend, this is not too hard (but is still much easier if everyone uses the same conventions), but it becomes even more useful if essentially everyone uses the same conventions. If everyone uses the same conventions we can imagine a future API might support queries like

(fluiddb/about starts-with "book:" and */rating > 8)
except terrycojones/has-read

or perhaps

(fluiddb/about starts-with "book:" and [njr, rustlem, ntoll]/rating > 8)
except terrycojones/has-read

or even

(fluiddb/about starts-with "book:" and mean(*/rating) > 8)
except terrycojones/has-read

if we simply trust that everyone is using a 0–10 scale for ratings. If we are worried about individuals skewing the system by using ratings on a 0–1,000,000, by adding some query complexity we can even imagine filtering those out, though I concede this will always be somewhat painful.

The Challenge

It’s the same old themes, but here I’m trying to illustrate just how directly all the Fluidinfo themes I keep harping on about relate to some of the most fundamental motivations for bulding this system in the first place. For those who believe all the stuff about conventions is a distraction that doesn’t matter, my challenge is: how else are we going to allow Terry’s query actually to work in Fluidinfo?

One possible answer lies within applications. Clearly, if a FluidBook application appears, that has its own conventions (whatever they may be) for storing information in Fluidinfo, it may be easy to get consistency within the data from that particular application. But Terry’s real dream involves allowing users to choose which applications to use and for data to be shared seamlessly across those applications. Again, it’s obvious that a few applications can agree conventions among themselves, but the most universal way of supporting interoperability is if conventions just exist at the level of Fluidinfo itself.

A wiki user might also provide another way forward.


  1. Anonymous29/4/11 03:59

    Please no wiki user. Attribution is key. A wiki user throws away the power of having 'facts' and 'opinions' attributed to someone/something specific that you can always decide to ignore later.

    Let the data be the data. Leave the higher order logic of what matters to applications 'above' the Fluid layer.


  2. Terrell

    Interesting. You'll hate my next blog post.

    You might be right; I'm certainly in two minds. But I think there some merits in the scheme, though to make it work well, some extra infrastructure would help, some if which might partly address some of your concerns.

    Or might not!

  3. Anonymous29/4/11 17:17

    Oh, bother. You're right - the next post is all wrong. :)

  4. Wow. All wrong is pretty wrong.

    So not only is a wiki user not the right solution: there isn't even a problem?

    I'm not totally convinced the wiki solution is right, but I'm pretty convinced there's a problem.

    Anyway, please do comment on the other post if you have the time/inclination; even if it is only to say "this is all wrong"...

  5. Personally, I like the idea of being able to use wild cards. They can return a lot of results, but that doesn't mean they're not useful. However, in your example above, I can see where it would return duplicate results if people are using the same conventions in namespaces and tags.

    Having an authoritative user is also an interesting idea, but then I would wonder how authoritative they actually are. It would probably take a person or group willing to find books in Fluidinfo and place the appropriate tags on them. They should also be willing to have books pointed out to them, in case they've overlooked something. This might be like the die-hard Wikipedia users.

    That said, if an object's about tag follows the book-1 or book-u convention, it should be possible to create a script to look for these objects and apply the namespace/book/title and namespace/book/author tags to them.

    I think a wiki namespace that anyone can use would become prey for spambots, unless someone kept on top of what was being done with it.

  6. Michael H

    All good points. I discuss the wiki user idea a bit more in the next post, and I definitely think some infrastructure would be needed to support it, and that the best implementation for Fluidinfo would require use of the permissions system to control spam.

    I think wildcarding namespaces will come; it's definitely under active discussion within the team.