About Tag: abouttag

Showing posts with label abouttag. Show all posts

15 December 2011

Fragmentation and URL Normalization

I have updated the abouttag.py library to use a new, better convention for normalizing URLs. The two main changes people will notice are:

URLs that represent directories will now include, rather than exclude, a trailing slash:
http://fluidinfo.com/
rather than
http://fluidinfo.com
There is now a dependency on the excellent urlnorm.py, by Jehiah Czebotar.

The Issue: Fragmentation¶

The twin evils that the abouttag.py library and this blog exist to fight are fragmentation and overloading.

Fragmentation occurs in Fluidinfo when different users store information about the same thing on different objects, while overloading occurs when people store information about different things on the same object. In general, both of these are undesirable. Fragmentation reduces data sharing and makes it harder to extract information from the system, whereas overloading creates ambiguity and confusion.

One of the more common uses for Fluidinfo is for tagging web pages, and it is very natural to use the URL as the about tag, as almost everyone does. There is not much of a problem with overloading in this case (except to the extent that URLs point to web pages that change over time), but there is definitely fragmentation.

I would distinguish between two kinds of fragmentation in the case of URLs.

Different representations of the same URL. Perhaps the most obvious example is the trailing slash on many URLs. Punctilious persons with good knowledge of W3C standards (and in particular RFC3986) prefer the inclusion of a trailing slash on URLs (and more generally, on URIs) where appropriate, and thus prefer
```
http://fluidinfo.com/
```
to the more colloquial
```
http://fluidinfo.com
```
Technically, these are different URLs, but web servers so routinely and uniformly redirect the latter to the former that they can be considered for all practical purposes the same. It seems highly desirable for any convention for about tags for URLs to map these two forms, along with other similar representational variants, to a common about tag.
Different URLs that may or may not represent the same web page. The most obvious example of this is the www. that used to be de rigeur and is now commonly (but not reliably) redundant. Most right-thinking webmasters (webmistresses?) routinely redirect these to the same place, there is no general guarantee that the www. form (http://www.fluidinfo.com/) and the bare form (http://fluidinfo.com/) will produce the same page, nor even that they should both work.

Standardizing this would therefore seem to be a normalization too far.

The Old and New Behaviour of abouttag.py¶

Fluidinfo is far from the only system with an interest in developing a canonical or normalized form for URLs. Search engines and social bookmarking sites (such as Pinboard and Delicious) work better if different URLs representing the same resource are collapsed, and as mentioned above, there is even a standard (RFC3986) for how to perform the canonicalization.

The relevant Wikipedia page describes six normalizations that preserve URL semantics. These are:

Converting the scheme and host to lower case. (HTTP:// → http:// and FLUIDINFO.COM → fluidinfo.com).

Capitalizing letters in escape sequences (%3a → %3A)

Decoding percent-encoded octets of unreserved characters (%7E → ~)

Adding a trailing slash where appropriate (http://fluidinfo.com → http://fluidinfo.com/)

Removing the default port (http://fluidinfo.com:80/ → http://fluidinfo.com/)

Removing dot-segments (http://fluidinfo.com/accounts/./new/ → http://fluidinfo.com/accounts/new/)

Happily, libraries to perform these normalizations already exist and are freely for a number of programming languages, including Python. As noted above, Jehiah Czebotar’s urlnorm.py performs the task admirably in Python, so in the version of abouttag.py that I just pushed to Github (version 0.6) I have made added a new convention, uri-2, corresponding to this behaviour and have made that the default. So now:

>>> from abouttag.uri import URI

>>> URI(u'http://fluidinfo.com')
u'http://fluidinfo.com/'

>>> URI(u'HTTP://FLUIDINFO.com:80')
u'http://fluidinfo.com/'

>>> URI(u'HTTP://FLUIDINFO.com:80')
u'http://fluidinfo.com/'

>>> URI(u'http://fluidinfo.com/a/./b/?arg=%7Ealice')
u'http://fluidinfo.com/a/b/?arg=~alice'

This is different from the old behaviour, which can be obtained by explicitly adding a convention argument of ‘uri-1’:

>>> URI(u'http://fluidinfo.com', convention=u'uri-1')
u'http://fluidinfo.com'
# note no trailing slash

>>> URI(u'HTTP://FLUIDINFO.com', convention=u'uri-1')
u'http://fluidinfo.com'
# Same downcasing, but again no trailing slash

>>> URI(u'http://fluidinfo.com:80', convention=u'uri-1')
u'http://fluidinfo.com:80'
# uri-1 didn't strip default ports

>>> URI(u'http://fluidinfo.com/a/./b/?arg=%7Ealice', convention='uri-1')
u'http://fluidinfo.com/a/./b/?arg=%7Ealice'
# nor did it undo unnecessary %-encoding or strip . & .. path segments.

Both the new and the old versions perform one additional normalization, which is to add a leading http:// if no scheme is present in the input. This is not because there is not a distinction between a domain and a URL, but rather because by calling the URI function the user is clearly indicating that this is a URI, which requires a scheme, and http:// is clearly the appropriate default scheme:

>>> URI(u'fluidinfo.com')
u'http://fluidinfo.com/'

Why...?¶

The reader may be wondering why I did not adhere to the RFC previously, and issued forth older versions of the abouttag library with the altogether inferior behaviour of uri-1. Ignorance, pure and simple.

07 July 2011

About Tags In Fish

I’ve added a new command to fish (and updated the online version, Shell-Fish accordingly) to allow easy construction of standardized about tags using the conventions from the abouttag library. They make use of a new abouttag function, available in the new generic.py file in the abouttag library, which takes the object type as its first parameter, and the usual parameters as a variable parameter list.
The new fish command is abouttag, though can also be abbreviated to about and its general form is:

fish abouttag <object type> <object specifiers>

The object type is something like book, album or fi-user and the object specifiers are the key parameters used to describe that object, in the same order as they are used in the corresponding function from the abouttag library.
The easiest way to illustrate and define these is with examples. The following examples are taken from a Unix system; on Windows, use double quotes rather than single around parameters. In the online version (Shell-Fish), and on Unix, single or double quotes work. In the online version, you don’t need the fish prefix (though it does work).
I should note that part of the motivation for adding this functionality is a desire to allow the command to be used to specify objects without knowing the exact form of their about tags. In Unix-like systems (Linux, Mac OS X, Solaris etc.), this is possible by using left quotes, which can be placed inside double quotes. Thus, the following, slightly ungainly command (using all three forms of quote) works, at least in bash:

$ fish show -F -a "`fish abouttag book 'Gödel, Escher, Bach: An Eternal Golden Braid' 'Douglas R. Hofstader'`" njr/rating
Object with about="book:gödel escher bach an eternal golden braid (douglas r hofstader)":
  njr/rating = 10

I will leave it to the reader to judge whether this is easier than using cut and paste. For those who don’t know about left quotes in Unix shells, a command enclosed in left quotes within another command is evaluted before its enclosing command; its output replaces the left-quoted phrase on the original command line. So in the case above, we first run the command

fish abouttag book 'Gödel, Escher, Bach: An Eternal Golden Braid' 'Douglas R. Hofstader'

which generates

book:gödel escher bach an eternal golden braid (douglas r hofstader)

as its output. In effect, the outer command is then transformed to

fish show -F -a "book:gödel escher bach an eternal golden braid (douglas r hofstader)" njr/rating

I hope to extend shell-fish, the on-line version of fish, to support left quotes, but that may take a little while.
The following examples are taken from the fish documentation, which is available online from http://fluiddb.fluidinfo.com/about/fish/fish/index.html.

Books and related items using the book-u convention (book, author)

$ fish abouttag book 'Gödel, Escher, Bach: An Eternal Golden Braid' 'Douglas R. Hofstader'
book:gödel escher bach an eternal golden braid (douglas r hofstader)

$ fish abouttag book 'The Feynman Lectures on Physics' 'Richard P. Feynman' 'Robert B. Leighton' 'Matthew Sands'
book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew sands)

$ fish abouttag book 'The Oxford English Dictionary: second edition, volume 3', 'John Simpson', 'Edmund Weiner'
book:the oxford english dictionary second edition volume 3 (john simpson; edmund weiner)

$ fish abouttag author 'Douglas R. Hofstadter' 1945 2  15
author:douglas r hofstadter (1945-02-15)

Music-related items (track, album, artist, isrc-recording)

$ fish abouttag track 'Bamboulé' 'Bensusan and Malherbe'
track:bamboulé (bensusan and malherbe)

$ fish abouttag album 'Solilaï' 'Pierre Bensusan'
album:solilaï (pierre bensusan)

$ fish abouttag artist 'Crosby, Stills, Nash & Young'
artist:crosby stills nash & young

$ fish abouttag isrc-recording 'US-PR3-73-00012'
isrc:USPR37300012

URLs and URIs (URI, URL)

$ fish abouttag uri FluidDB.fluidinfo.com
http://fluiddb.fluidinfo.com

$ fish abouttag url https://FluidDB.fluidinfo.com/one/two/
https://fluiddb.fluidinfo.com/one/two

$ fish abouttag URI http://fluiddb.fluidinfo.com/one/two/
http://fluiddb.fluidinfo.com/one/two

$ fish abouttag URL 'http://test.com/one/two/?referrer=http://a.b/c'
http://test.com/one/two/?referrer=http://a.b/c

Fluidinfo objects (fi-user, fi-namespace, fi-tag)

$ fish abouttag fi-user njr
Object for the user named njr

$ fish abouttag fi-namespace njr/misc
Object for the namespace njr/misc

$ fish abouttag fi-ns njr/private
Object for the namespace njr/private

$ fish abouttag fi-tag terrycojones/private/rating
Object for the attribute terrycojones/private/rating

Database components (db-table, db-field)

$ fish abouttag db-table 'elements'
table:elements

$ fish abouttag db-field 'name' 'elements'
field:name in table:elements

Miscellaneous (planet, element)

$ fish abouttag planet 'Mars'
planet:Mars

$ fish abouttag element 'Helium'
element:Helium

09 June 2011

The Music of Fluidinfo II

I pushed an update to the abouttag.py library last night; it now includes some conventions and normalization for some music-related items. This supports work by Eric Seidel (@gridaphobe), who is looking at importing data from MusicBrainz to Fluidinfo.

These are similar (but not identical) to the ideas I proposed previously in the post The Music of FluidDB I: Albums, Tracks and Songs.

The first three kinds of things covered are works—named albums and named tracks respectively. As with books, this conceptual work seems like the single most important level to represent in Fluidinfo. So there may be many different issues, editions and pressings of The Dark Side of the Moon by Pink Floyd, but there is only one work. Even more clearly, Billie Holliday may have recorded God Bless the Child a number of times (I’m playing through several as I type this) , but there is only one conceptual work God Bless the Child sung by Billie Holliday.

Albums (as works). The convention for this, called album-u, is similar, but not identical to the convention for books, having the general form
album:name of album (artist)
The name of the album and the artist are normalized using the usual normalize function from abouttag.py, removing some punctuation, regularizing spacing and converting to lower case, but not removing accents. (I think it’s increasingly clear that removing accents was a mistake in book-1; I’m now recommending using the relatied book-u conventions, which is like book-1 except that it preserves accents.)

The other main difference between the book-u convention and the album-u convention is that in the case of books, multiple authors are consolidated into a standard list, separated by semicolons (in fact, a semicolon followed by a space). This works less well for music, where artists take more different forms Diana Ross and the Supremes, Pink Floyd, John Renbourn and Stefan Grossman, Crosby, Stills, Nash & Young etc. Since MusicBrainz, in particular, has a well-defined artist field, which is supposed to be the official recording credit, Eric is just planning to standardize that.

Example usage is:
from abouttag.music import album

print album(u"Solilaï", u'Pierre Bensusan')
print album(u"Déjà   Vu", u'Crosby, Stills, Nash & Young')
producing
album:solilaï (pierre bensusan)
album:déjà vu (crosby stills nash & young)
Of course, we may also create objects for paricular releases of an album, but those would use a different convention.
Named Tracks / Recorded Songs. These are have a form that is identical to albums except that they use track: as the prefix. Again, two different recordings of the same song get consolodated into a single (conceptual) track (convention track-u).

Example usage is:
from abouttag.music import track

print track(u'Bamboulé', u'Bensusan and Malherbe')

print track(u'''Archie Campbell/Marjorie Campbell/Miss Lyall's '''
            u'''Strathspey/Miss Lyall's Reel/The St Kilda Wedding''',
            u'The Cast'),
producing
track:bamboulé (bensusan and malherbe)
track:archie campbell marjorie campbell miss lyalls strathspey miss lyalls reel the st kilda wedding (the cast)
Recordings. In the case of conceptual tracks, it is particularly clear that same artist may record the same track several times. Happily, there is a standard identifier for such recordings of tracks, the International Standard Recording code. This is 12-character code, usually formatted for print as CC-XXX-YYY-NNNNN, where CC is a registrant country code, XXX is a registrant code, UU is the last two digits of the registration year and NNNNN identifies the recording.

Despite minor misgivings on my part (I would have chosen to keep the dashes, since Fluidinfo generally favours humans over machines), we have chosen to standardize this in the form isrn:CCXXXYYYNNNNN.

Examples:
from aboutag.music import isrc_recording

print isrc_recording(u'US-PR3-73-00012')
print isrc_recording(u'uspr37300012')
produces
isrc:USPR37300012
isrc:USPR37300012
Artist. The artist is simply identified by artist:name, where name is normalized as usual. Since accents are preversed, metal fans need not fear for their heavy metal umlauts (a.k.a. röck döts). For example:
from aboutag.music import artist

print artist(u"Crosby, Stills, Nash & Young")
print artist(u"Motörhead")
produces:
artist:crosby stills nash & young'
artist:motörhead'

Disambiguation, search and the related-to tag¶

I’ll do a longer post on this shortly, but an interesting an useful metaconvention is emerging within Fluidinfo.

I think there is a slightly reluctant but growing consensus within the Fluidinfo community that it makes sense to use about tags that exhibit pretty good uniqueness—I’ve suggested that at worst we should probably aim for conventions that mean that when we deal with a million objects we are unlikely to get a clash in about tags, not just within the list but with anything else anyone is ever likely to put into Fluidinfo.

At the same time, there is natural desire to make it easy for humans to navigate Fluidinfo via the natural about tags. Mercury the planet and mercury the element share the natural about tag mercury, but there is clearly a problem if we just stick information (like mass, or radius) on the object with about tag mercury. The about tags planet:mercury and element:mercury are much less ambiguous. (Unfortunately, in the early days I actually used planet:Mercury and element:Mercury; in my copious spare time, I’ll probably move the data over to the lower-case versions).

The emerging meta-convention is:

Store data on the most specific, unambiguous object reasonably available

book:animal farm (george orwell)

album: dark side of the moon (pink floyd)

element:mercury etc.

Help humans by also using the more natural, ambiguous objects

animal farm

dark side of the moon

mercury

as disambiguation nodes (à la Wikipedia), adding appropriate related-to tags that point to the various more specific items. The convention for these is that they are sets of about tag strings.

As an example, MusicBrainz might add a tag

musicbrainz.org/related-to=["album:dark side of the moon (pink floyd)"]

to the object with about tag dark side of the moon.

Similarly, an the Fluidinfo object with the about tag money might find itself tagged with (for example):

musicbrainz.org/related-to=["track:money (pink floyd)", "track:money (the flying lizards)",
                            "track:money (michael jackson)", "album:money (kfmdm)"]
miro/books/related-to=["book:money (martin amis)"]
imf.org/related-to=["economics:money"]

and so forth. This looks like it could have legs.

The key principle is: put the data on specific object at the right conceptual level, disambiguating as early as possible, i.e. don’t start by assuming that money could only refer to the Pink Floyd song, and use a convention that works well for all objects in the class of interest. Similarly, if you want put data that’s only applicable to a particular recording of that song, stick it on a more specific object (in this case, perhaps the isrn: object), probably adding a related-to tag pointing from the track: object to the various isrn: objects.

My only problem with this convention is that the term related-to sounds symmetrical, whereas it’s being use in a fairly directed way. Perhaps could-refer-to would be better.

24 May 2011

A Search Engine for Fluidinfo

I wrote an extremely simple search front-end for Fluidinfo which you can access at http://abouttag.appspot.com/search.

It is extremely simple. You type one or more search terms into the box and it “searches” Fluidinfo about tags for those terms.

For example, here’s what happens if you type in solitude:

and here’s what happens if you type in marquez book:

Here’s what you need to know:

All this does is turn this into a values query on Fluidinfo that ANDs together the search terms (after white-space stripping). So the query part for these two searches become
fluiddb/about matches "solitude"
and
fluiddb/about matches "marquez" AND fluiddb/about matches "book"
respectively.
I don’t fully understand Fluidinfo’s string matching, which is based on Lucene, but it is fairly search-engine like. I think the following is true:

case is ignored in matching

punctuation is discarded

only whole-words match

accented characters match themselves (case insensitively) and not their non-accented counterparts, and vice versa. So café matches CAFÉ but not cafe and CAFE matches cafe but not CAFÉ. (This was broken when this was originally posted, but is fixed now.)

If we’re lucky, Manuel (@ceronman) or Esteve (@esteve) might add clarification in the comments, which I will promote to here if appropriate.

Consequences of the above include:

You can’t search on prefixes like film:, because the puntuation is discarded (though you can search on film and it will match things containing film:)

There is no stemming or substring matching, so soli won’t match solitude etc.

At the moment a maximum of 100 results are returned and there is no paging implemented; I plan to add that soon.

Result order is essentially random. If I implement paging, I will probably sort them. My first thought is to sort them as shortest-to-longest, with an alphabetical subsort to break ties. (Comments?)

Various links are returned for each matching object.

The main link points to the raw Fluidinfo object, accessed though /about. This will show you its tags as a JSON dump.

The object’s ID is shown underneath, and that links to the raw object in Fluidinfo, this time through /objects.

Links to both the butterfly and daisy visualizations from http://abouttag.com are provided.

Finally, a link to the object in P A Parent’s Fluidinfo Explorer is given.

I thought about adding a curl link too, that would show the syntax for accessing the object with curl (cURL, if you prefer), but I couldn’t really think of a neat way of doing it; a link to a one-line page seems over the top and I hate pop-ups. I suppose some kind of javascript manipulation to show the curl text below would be a possibility. Let me know if you would find this useful.

Like the rest of the About Tag site, the application is built on Google’s App Engine. Unfortunately, this implements a time-out after 10 seconds on all HTTP requests, and even more unfortunately, some searches in Fluidinfo take more than 10 seconds. If you see a time-out, that’s probably what’s happening. This is usually because too many results are being returned. Unfortunately, Fluidinfo does not implement any form of paging or limiting of results at the moment, so the only way round this is to write a more specific query that will have fewer results.

For example, at the moment, when I search on book, it consistently times out; if I instead search on book orwell, it consistently works.

There’s not much I can do about this: the Fluidinfo team is working hard on making Fluidinfo faster and is (I believe) actively considering implementing some kind of paging mechanism.

At the moment, only the about tag (fluiddb/about) is searched, (which is, I suppose, appropriate for this blog/site). It would be very easy for me to provide other interfaces. One obvious thing would be to allow the user to select the tag searched, and another would be to allow a full Fluidinfo query to be typed. If there’s interest, I can do these.

If you want to jump straight to results, you can just add a ?q=terms to the end of the search URL (http://abouttag.appspot.com/search). For example, http://abouttag.appspot.com/search?q=george+orwell will reveal what Fluidinfo knows about the great man. Use + to separate search terms in the URL or, if you prefer, use percent encoding.

This was implemented extremely quickly, and has only been tested very briefly. Let me know if you find problems, whether you find it useful, if you’d like any of the other versions etc.

05 April 2011

Pretty Good Uniqueness

Software developers are neurotic about uniqueness—no two files may share the same path, no two users the same ID. That’s probably good: we like money and email to go to right person.

Over in the Real World™, people are more relaxed. We tolerate quite a lot of ambiguity, relying partly on context to remove it, and partly on clarification when necessary–“Paris, France, not Paris, Texas”. We even tolerate a certain level of confusion and error as a reasonable price to pay for not always having to refer to each other by National Insurance number.

Terry Jones (not the Python, nor the Qu’ran burning pastor, but @terrycojones, the unorthodox visionary behind Fluidinfo) frequently says that he wants to make working with information in computers more like working with information in the Real World™. It’s a useful goal.

Almost from the first moment I heard about Fluidinfo, with its model of information sharing based on tagging common objects, I’ve been interested in (some might might say obsessed with) the question of how to map Real-World™ objects and concepts (like Paris, Animal Farm, The Eiffel Tower, Existential Philosophy and the ring on my finger) to Fluidinfo objects, romantically identified, as they are, by 128-bit integers (hubristically so-called ‘universally unique identifiers’ [UUIDs]) such as 6387ab3f-e3d5-4ca9-bd13-ae3f-fd9c1830.

Fluidinfo’s about tag (fluiddb/about, to give it its full name) was created specifically to make it easier to decide where to put information in Fluidinfo. Every object in Fluidinfo, when it’s created, can optionally have this about tag set to a unicode string and Fluidinfo guarantees that about tags are unique, i.e. that no two different objects will ever share an about tag. As a result, you can directly address objects in Fluidinfo by specifying an about tag. For example, http://fluiddb.fluidinfo.com/about/Paris is the URL for the Fluidinfo object with the about tag “Paris” (UUID 17ecdfbc-c148-41d3-b898-0b5396ebe6cc, since you ask).

Fluidinfo, by Terry’s very specific design, does not force anyone to use about tags in any specific way. Any Fluidinfo user can attach any information to any Fluidinfo object she likes. If user jacqui decides to attach information about Paris, Texas to the Paris object above, and gemma chooses to use it to store information about Paris, France, that is entirely fine. It’s even fine of Fluidinfo user anarchist decides to store information about Birmingham (Alabam), or existential philosophy, or her entire record collection on the same object. There will be no one from Fluidinfo complaining or banning or undoing (though it’s possible that those with acute hearing may perceive a quiet “tsk, tsk” sound emanating from the author of this blog).

I believe, however, that most Fluidinfo users will want there to be conventions for about tags that will encourage information about the same thing to be stored on a well-defined common object, and for information about different things to be stored on different objects. Of course, we won’t always get those conventions right first time, and they will evolve over time, but my feeling is that a few hours of thinking can avoid many, many hours of trial error. The question is: what should those conventions be?

My feeling is that what we need to aim for is “pretty good uniqueness”, a concept that might be compared loosely to “pretty good privacy” or “probabilistically approximately complete” learning. I don’t have a formal definition, nor even a very good rule of thumb, but I think we need to aim for a set of about tag conventions that are easy to use and which mean that collisions are very rare, but that we should not aim for absolute uniqueness, as to do so would lead inexorably to conventions that are much less appealing to humans. In other words, we should aim to make about tag conventions lie in a sweet spot somewhere between the computer programmer’s “absolute, guaranteed, uniqueness in all circumstances” and the Real-World™, human-style “let’s not worry about it too much and just deal with collisions when they occur”.

The nearest I have to a rule of thumb is that when you’re uploading a reasonably large quantity of data to Fluidinfo (say, some tens of thousands of objects), most of the time, you should not encounter a conflict. I’m not sure how to quantify this. If 1% of items have conflicting about tags, I’m pretty clear that this is much to high a collision rate. And I’m pretty clear that 1-in-a-billion is OK. My guess is that it is probably good enough to aim for collision rates below about 1-in-a-million. But that’s just a feeling.

This can be made more concrete with some examples. One convention I suggested that seems to be being used quite widely and successfully is for books (as works, rather than individual editions, printings etc.). The basic form of this is to combine a ‘book:’ prefix with a normalized title and author. The normalization aims to remove ambiguity with case, punctuation etc., to make it more likely that different people will arrive at the same about tag, without significantly affecting uniqueness or legibility. So an example about tag for a book is:

book:nineteen eighty four (george orwell)

Notice that the (troublesome) hyphen that we would normally include when writing “nineteen eighty-four” has been removed, as have capitals (there’s a library available to do the standardization, which can be used in python) or online.)

[The original version of the convention (book-1) also removed all accents from letters in an effort to reduce further the likelihood of minor variations; however, when Nicholas Tollervey (@ntoll) and Terry started publishing large volumes of book data that included some non-European names it became clear that this convention sometimes went a normalization too far, so the (so-far undocumented) book-u variant convention was born, in which letters are mapped to lower case, but accents are preserved. (This is supported in the python library, but not yet in the web app.)]

These conventions for about tags for books seem to me to hit the sweet spot I was talking about. Book titles, alone, are definitely not sufficiently unique in two different respects: first, it is not uncommon for different authors to write books with the same title; secondly book titles (alone) are frequently shared with other (non-book) items, like films, people, places etc. However, by combining a prefix (book:) that specifies the class of object, together with the title and the author (all normalized), we get something that feels, for practical purposes, pretty good uniqueness. I would be surprised if there are not examples of pairs of books that share both author and title, but I suspect those are so rare that they will cause us little trouble and (personally) feel quite content to do some ad hoc disambigation to handle those cases.

Indeed, the pattern of a class prefix, a main identifier, and a disambiguator, feels like a useful pattern for many kinds of Real-World™ entities to me. I’ve been discussing films, for example, with Michael Hawkes, in the comments on another blog post, and there is seems that using either film:title (year) or film:title (director) will probably work well. Again, there might be cases in which two directors sharing a name produce films of the same name, or in which two films of the same name are produced in the same year, but these seem likely to be so rare that ad hoc disambiguation of those cases might be acceptable. It is also, of course, not a coincidence that in the real world films are often identified by title and year or title and director. Michael and I both lean toward year as probably the better disambiguator, so I suspect I will soon be proposing film:title (year) as a convention; though American readers might prefer a “movie:” prefix.

For me, the other great virtue of this style of about tag is that it is very easy to construct the canonical about tag using only information that the user might reasonably expect to have at hand, rather than depending on some kind of external lookup. To labour the point, if I want to tag a book, I probably know the title and author, and can certainly find that information in the book. With a film, I concede, it would be less unusual to know the title but not the year or director, but even there, this data is easily available from multiple sources, crucially including from the film itself.

Perhaps unsurprisingly, there are those who feel that the whole notion of trying to organize, specify, or guide conventions is objectionably authoritarian and/or pointless, and that it would be much better simple to see what emerges organically. (Terry has been known to accuse me of “fascist librarian” tendencies, though I sure he means it in the nicest possible way.) Terry and I both studied so-called genetic algorithms, in which evolutionary processes are simulated on computers to tackle search and optimization tasks, and we are both impressed with the power of evolutionary mechanisms. I, however, fear that Fluidinfo doesn’t have the luxury evolutionary timescales to succeed, and therefore tend to favour trying to help evolution along a little. If you don’t, just ignore all this, do your own thing, and pay no attention to the annoying tsking from Scotland.

19 January 2011

The Music of FluidDB I: Albums, Tracks and Songs

I have been thinking for a while about what conventions for tagging kinds musical entities in FluidDB. The kinds of things I have in mind include recordings of music, pieces of music (compositions), artists and composers. My firmest conclusion so far is that it’s complicated and I can’t tackle it all in one go.

In particular, classical music feels very complicated to me, with a common situation for a classical “record” being recordings of several pieces with somewhat variable names, often by different composers, being played often by a somewhat fluid and ambiguous collection of musicians.

In this post, therefore, I’m going to try to tackle what feels like a simpler problem by restricting myself to considering non-classical music and three kinds of entities—albums, tracks and songs.

My basic suggestion is to adopt conventions very similar to those I have been championing for books, in the form of the book-1 convention.

Books (Recap)¶

Recall that book-1 convention for about tags for books in English has the following basic components:

the prefix book:

the title of the book, normalized using NACO-like conventions, which standardize to lower case, remove most punctuation and accents and regularize spacing;

the author, again normalized in a NACO-like manner, in parentheses.

For example, Alice in Wonderland, by Lewis Carroll, uses the about tag

book:alice in wonderland (lewis carroll)

So far this convention seems to have worked quite well. Its virtues include:

it is simple to construct with only easily available information (the stuff you can see if you have the book or a normal reference to it)

it is unique for the almost all books

it is clearly identified as a book (and thus disambiguated from a film, for example).

The next stage beyond a single-author book is multi-author books, and there the convention is simply to list the authors, in the order they appear on the book, separated by semicolons. For example, The Feynman Lectures on Physics, by Richard P. Feynman, Robert B. Leighton and Matthew Sands uses the about tag:

book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew sands)

Albums, Tracks and Songs¶

Recorded non-classical music consists primarily of albums—a named collection of tracks, normally purchased together—and individual tracks, sometimes known as singles or songs.

At the simplest level, the conventions I am going to propose for about tags for albums and tracks are very similar to those for books but using the prefixes album: and track:. So the album, The Dark Side of the Moon, by Pink Floyd, is

album:the dark side of the moon (pink floyd)

and the track The Great Gig in the Sky, from that same album, is

track:the great gig in the sky (pink floyd)

But there are number of points to discuss.

Albums¶

The suggested about tag for albums is fairly straightforward. The main complication/ambiguity I can see concerns multi-volume sets. So, on vinyl, for example, Neil Young’s Decade has three disks; and it is a double CD. This is quite an easy case: I think we ignore the ‘disk’ number entirely where an just regard double and triple albums as albums. So all of Decade is:

album: decade (neil young)

For multi-volume collections that are normally sold separately, simply include the volume number. So, for example, The Tatum Group Masterpieces Volume 1, by Art Tatum, Benny Carter, Louis Bellson, becomes

album:the tatum group masterpieces volume 1 (art tatum; benny carter; louis bellson)

The NACO-like normalization conventions were described in this post and are implemented in the abouttag library.

The handling of artists is in principle quite simple, though in practice slightly hard to automate completely. My suggestion is that whenever there is a list of musicians, as with authors, they are simply separated with semicolons (and a space); any ampersands or ands are removed. In the case of groups, the group name is simply used. The interesting and slightly troubling cases are those where a group combines with person. The most common case of this is exemplified by Diana Ross and the Supremes. My suggestion is that such cases are left intact, other than normalization, using ‘and’ rather than ampersand (&). So the album “Reflections” becomes

album:reflections (diana ross and the supremes)

There are probably awkward corner cases, but I think this handles most.

The biggest problem I foresee is that it will hard to automate the construction of the standard form of an artist from something like iTunes metadata because the input (from Gracenote) doesn’t separate out a list of artists in any remotely consistent way, so I think standardizing them will require a degree of human intervention. This is not, however, in any way particular to this suggested convention; it’s fundamentally to do with the fact that some artists identified as a list of people, and others have a group name, and telling these apart is hard, even without complication such as the band Alice Cooper!

Here are a few examples of the sorts of album about tags I’m suggesting:

The Black Balloon, by John Renbourn album:the blank balloon (john renbourn)

The Composer, by Thelonious Monk album:the composer (thelonious monk)

Fleetwood Mac, by Fleetwood Mac album:fleetwood mac (fleetwood mac)

Wu Wei, by Pierre Bensusan album:wu wei (pierre bensusan)

The Tatum Group Masterpieces Volume 1, by Art Tatum, Benny Carter, Louis Bellson album:the tatum group masterpieces volume 1 (art tatum; benny carter; louis bellson)

Ms. Right, by Duck Baker album:ms right (duck baker)

‘Round About Midnight, by The Miles Davis Quintet album:round about midnight (the miles davis quintet)

A Matter Of Time, by Gordon Giltrap & Martin Taylor album:a matter of time (gordon giltrap; martin taylor)

Musiques / Solilaï, by Pierre Bensusan album:musiques solilai (pierre bensusan)

Live Au New Morning, by Bensusan & Malherbe album:live au new morning (bensusan; malherbe)

Eye To The Telescope, by KT Tunstall album:eye to the telescope (k t tunstall)

Grace & Danger, by John Martyn album:grace & danger (john martyn)

Alas, I Cannot Swim, by Laura Marling album:alas i cannot swim (laura marling)

Lady In Autumn: The Best Of The Verve Years, by Billie Holiday album:lady in autumn the best of the verve years (billie holiday)

Tracks¶

I was originally minded to suggest using song: as the prefix for individual album tracks, notwithstanding the fact that this is slighty inappropriate for instrumental pieces. This was until I realised that we will certainly want to have entries for songs themselves (independent of artist) in FluidDB. Given this, I think we have little choice but to fall back to track, which is more perhaps more appropriate anyway.

I think there are couple of points to made about tracks. The first is that I do not propose to tie them to albums. Thus if an artist records a track (piece/song), I suggest that in the common case we don’t distinguish between different records. When you talk about Billie Holiday’s recording of God Bless the Child, you actually talk about all her records of that song, in the general case.

track:god bless the child (billie holiday)

Similarly, if, as is quite common, a track is qualified by (live) or [live], I suggest that be omitted in the standard case.

The other reasonably common complication, particularly for folk music, is the medley. In this case, my suggestion is just hand the track name to the NACO-like normalization routine and use what it produces. In most cases, this works fine.

To try to illustrate lots of common cases, here is a fairly long list of examples:

Rhythm-a-Ning, by Thelonious Monk track:rhythm a ning (thelonious monk)

Round Midnight, by Thelonious Monk track:round midnight (thelonious monk)

Straight, No Chaser, by Thelonious Monk track:straight no chaser (thelonious monk)

Bourrée I and II, by John Renbourn track:bourree i and ii (john renbourn)

Medley: The Mist Covered Mountains of Home / The Orphan / Tarboulton, by John Renbourn track:medley the mist covered mountains of home the orphan tarboulton (john renbourn)

Monday Morning, by Fleetwood Mac track:monday morning (fleetwood mac)

Poussière d’Amants, by Pierre Bensusan track:poussiere damants (pierre bensusan)

Doherty’s - Return to Milltown - Tommy People’s, by Tony McManus track:dohertys return to milltown tommy peoples (tony mcmanus)

Jackie Coleman’s - The Milliner’s Daughter - Rakish Paddy - Connor Dunn’s, by Tony McManus track:jackie colemans the milliners daughter rakish paddy connor dunns (tony mcmanus)

Blues in C, by Art Tatum, Benny Carter, Louis Bellson track:blues in c (art tatum; benny carter; louis bellson)

S’Wonderful, by Art Tatum, Benny Carter, Louis Bellson track:swonderful (art tatum; benny carter; louis bellson)

Makin’ Whoopee, by Art Tatum, Benny Carter, Louis Bellson track:makin whoopee (art tatum; benny carter; louis bellson)

(I’m Left With the) Blues in my Heart, by Art Tatum, Benny Carter, Louis Bellson track:im left with the blues in my heart (art tatum; benny carter; louis bellson)

The Nine Maidens a. Clarsach b. The Nine Maidens c. The Fiddler, by John Renbourn track:the nine maidens a clarsach b the nine maidens c the fiddler (john renbourn)

Ms. Right, by Duck Baker track:ms right (duck baker)

‘Round Midnight, by The Miles Davis Quintet track:round midnight (the miles davis quintet)

Ah-Leu-Cha, by The Miles Davis Quintet track:ah leu cha (the miles davis quintet)

Across The Pond, by Gordon Giltrap & Martin Taylor track:across the pond (gordon giltrap; martin taylor)

G & T Blues, by Gordon Giltrap & Martin Taylor track:g & t blues (gordon giltrap; martin taylor)

Abide With Me / Old Gloryland, by Stefan Grossman & John Renbourn track:abide with me old gloryland (stefan grossman; john renbourn)

Badhra, by Anouar Brahem, John Surman, Dave Holland, track:badhra (anouar brahem; john surman; dave holland)

Biodag Aig Mac Thomais/The Nine Pint Coggie/The Spike Island Lasses, by Tony McManus track:biodag aig mac thomais the nine pint coggie the spike island lasses (tony mcmanus)

Three Pieces By O’Carolan;The Lamentation Of Owen Roe O’Neill; Lord Inchiquin; Mrs Power (O’Carlan’s Concerto), by John Renbourn track:three pieces by ocarolan the lamentation of owen roe oneill lord inchiquin mrs power ocarlans concerto (john renbourn)

Heman Dubh, by Pierre Bensusan track:heman dubh (pierre bensusan)

Le Voyage pour L’Irelande, by Pierre Bensusan track:le voyage pour lirelande (pierre bensusan)

50 Ways To Leave Your Lover, by Paul Simon track:50 ways to leave your lover (paul simon)

La Danse Du Capricorne 1, by Pierre Bensusan track:la danse du capricorne 1 (pierre bensusan)

Reels - “The Pure Drop”/”The Flax In Bloom”, by Pierre Bensusan track:reels "the pure drop" "the flax in bloom" (pierre bensusan)

Mille Vallées, by Bensusan & Malherbe track:mille vallees (bensusan; malherbe)

Bamboo Shoot (Improvisation), by Bensusan & Malherbe track:bamboo shoot improvisation (bensusan; malherbe)

Black Horse And The Cherry Tree, by KT Tunstall track:black horse and the cherry tree (k t tunstall)

Universe & U, by KT Tunstall track:universe & u (k t tunstall)

Sigmund Freud’s Impersonation Of Albert Einstein In America, by Randy Newman track:sigmund freuds impersonation of albert einstein in america (randy newman)

Mr. President (Have Pity On The Working Man), by Randy Newman track:mr president have pity on the working man (randy newman)

I Love L.A., by Randy Newman track:i love l a (randy newman)

The Blues, by Randy Newman track:the blues (randy newman)

Through-Us-All, by Isaac Guillory track:through us all (isaac guillory)

A Terrible Pickle, by Dean Friedman track:a terrible pickle (dean friedman)

Money, by Pink Floyd track:money (pink floyd)

Take Five, by Dave Brubeck Quartet track:take five (dave brubeck quartet)

Pirates (So Long Lonely Avenue), by Rickie Lee Jones track:pirates so long lonely avenue (rickie lee jones)

The Returns, by Rickie Lee Jones track:the returns (rickie lee jones)

Chuck E’s In Love, by Rickie Lee Jones track:chuck es in love (rickie lee jones)

Harry’s House/Centerpiece, by Joni Mitchell track:harrys house centerpiece (joni mitchell)

I’s A Muggin’ (Rap), by Joni Mitchell track:is a muggin rap (joni mitchell)

Miles Beyond, by Mahavishnu Orchestra track:miles beyond (mahavishnu orchestra)

A Surfer Courted Me, by Martha Tilston and the Woods track:a surfer courted me (martha tilston and the woods)

Lookin’ On, by John Martyn track:lookin on (john martyn)

The Captain And The Hourglass, by Laura Marling track:the captain and the hourglass (laura marling)

Le Chien Sur Les Genoux de la Devineresse, by Anouar Brahem, Barbaros erkose, Kudsi Erguner & Lassad Hosni track:le chien sur les genoux de la devineresse (anouar brahem; barbaros erkose; kudsi erguner; lassad hosni)

A Prayer, by Madeleine Peyroux track:a prayer (madeleine peyroux)

Was I?, by Madeleine Peyroux track:was i (madeleine peyroux)

(I Got A Man Crazy For Me) He’s Funny That Way, by Billie Holiday track:i got a man crazy for me hes funny that way (billie holiday)

Lover Man (Oh, Where Can You Be?), by Billie Holiday track:lover man oh where can you be (billie holiday)

St. Louis Blues, by Billie Holiday track:st louis blues (billie holiday)

Songs¶

[UPDATE 2011/01/19: I have modified this recommendation since it was first posted, after thinking more about the lack of consistency in how composers are identified.]

I have given less thought to songs (as distinct from tracks, or recordings of songs), but the obvious convention would seem to be to use the song: prefix, followed by the normalized song title, followed by the composer or composers in brackets, again in whatever order they are normally listed. The only real complication I can see there is the fairly common case in which music and lyrics are given separate credits. In that case, I think I suggest simply listing the music composer ahead of the lyrics composer.

The slightly subtle question concerns ow to standardize the composer’s name. I the case of artists (and authors) my normal recommendation is to start from the name as it appears on the work, so John Martyn, J. D. Salinger etc. This works well because you just have to look at the work to see how it is written; and for this reason, there’s a well-defined, standard place to look (the work).

Composers are more awkward, because it is much less clear where to look. If you own a record, the easy thing to do is to look at the sleeve, or the liner notes, or sometimes on the record (or CD) itself. But the same song can be recorded many times and the composer won’t always be displayed consistently. You could also look at the sheet music. Or in Wikipedia. In short, there is no consistency. A quick look through the first half dozen make it clear there’s not even consistency on a single CD in many cases.

In this case, therefore, my recommendation is to use surnames only. So in a simple case, Summertime by George Gershwin, is

song:summertime (gershwin)

The Lennon/McCartney partnership would produce, for example

song:hey jude (lennon; mccartney)

A case in which lyrics and music are credited separately would be Officer Krupke, from Westside Story, by Leonard Bernstein (music) and Stephen Sondheim (lyrics). So this would be:

song:office krupke (bernstein; sondheim)

The reason I’ve gone for surname only is that it seems to involve very little loss of precision (it will be rare indeed for two songs with the same title to have different composers with the same surname but different forenames), and to use the smallest amount of information that is commonly available. I think this is probably a fairly good convention.

Comments Invited¶

As ever, I’d be interested in thoughts from anyone, in the blog comments or directly. I haven’t pushed an updated version of the abouttag library containing these to github yet, but will probably do so in a few days unless there is significant push-back.

03 January 2011

100 Bestsellers in FluidDB: So What?

This evening I published another hundred books to FluidDB. This time it was a list of the 100 best-selling books of the last 12 years, as published by the ever-wonderful Guardian Data Store. I published them as a table, mostly using the conventions documented in this post. If you’re using a modern browser (almost anything other than Internet Explorer) you can see a visualization of the FluidDB object for the table at abouttag.com/butterfly/about/table:bestsellers-1998-2010 and the best-selling book (Dan Brown’s The Da Vinci Code, depressingly) at abouttag.com/butterfly/about/book:the da vinci code (dan brown).

There’s a tag on each book that hyperlinks to the next, so you you could even click through all hundred if you really wanted.

So What?¶

Why should you or anyone else care that I’ve published this data to FluidDB? After all, the Guardian made the data available on Google docs, so anyone can do anything with it anyway. What’s the benefit of having it in FluidDB? I’m going to try to show a few things that might convince you there’s something interesting about putting this sort of data in FluidDB.

1. Query¶

The most obvious thing is that you can query data in FluidDB from anywhere with internet access without even having an account. For example, the query to find the best-selling book from the list in FluidDB is this:

miro/bestsellers-1998-2010/rank = 1

You can issue this query from anything that can talk to FluidDB and it should return the object corresponding to The Da Vinci Code, by Dan Brown. Here are a few ways of doing just that.

You can use the FluidDB Explorer and just paste the query into the query box. It should locate the object (identifying it by its about tag, which is book:the da vinci code (dan brown) and also by its FluidDB ID, which is e7fee95f-4dcd-458b-8893-b56352d455cf. If you then click on either the about tag or the object ID, the explorer will give you a list of tags on the object, and tell you there are too many to show (which isn’t really true). If you then click ‘Load all tag values’ it will get them and show them to you.
You can use my python library fdb, which has a command line tool with it and type:
fdb show -q 'miro/bestsellers-1998-2010/rank = 1' /miro/bestsellers-1998-2010/title /miro/bestsellers-1998-2010/author
This produces the following output:
1 object matched
Object e7fee95f-4dcd-458b-8893-b56352d455cf:
  /miro/bestsellers-1998-2010/title = "The Da Vinci Code"
  /miro/bestsellers-1998-2010/author = "Dan Brown"
You can use curl at the command line (which is a utility installed on most systems by default) and type
curl 'http://fluiddb.fluidinfo.com/values?query=miro/bestsellers-1998-2010/rank%3D1&tag=miro/bestsellers-1998-2010/title&tag=miro/bestsellers-1998-2010/author'
which produces:
{
  "results":
  {
    "id":
    {
      "e7fee95f-4dcd-458b-8893-b56352d455cf":
      {
        "miro/bestsellers-1998-2010/author": {"value": "Dan Brown"},
        "miro/bestsellers-1998-2010/title": {"value": "The Da Vinci Code"}}
    }
  }
}
(I’ve reformatted this slightly, but otherwise this is the exact output from FluidDB.).
You could even just use the query directly in your browser’s URL bar. Pasting the following into the address bar should work in almost all browsers
http://fluiddb.fluidinfo.com/values?query=miro/bestsellers-1998-2010/rank%3D1&tag=miro/bestsellers-1998-2010/title&tag=miro/bestsellers-1998-2010/author
again, producing:
{
  "results":
  {
    "id":
    {
      "e7fee95f-4dcd-458b-8893-b56352d455cf":
      {
        "miro/bestsellers-1998-2010/author": {"value": "Dan Brown"},
        "miro/bestsellers-1998-2010/title": {"value": "The Da Vinci Code"}}
    }
  }
}
In quite a few browsers, even the following will work:
http://fluiddb.fluidinfo.com/values?query=miro/bestsellers-1998-2010/rank=1&tag=miro/bestsellers-1998-2010/title&tag=miro/bestsellers-1998-2010/author

2. More interesting queries¶

Regular readers of this blog will recall that I previously published a rather larger set of 1,000 books to FluidDB. These were again originally from the Guardian (though pre-dated the data store/data blog) and this time consisted of the the Guardian’s 1,000 novels that everyone must read. (See this post and this post for details.)

So an obvious question is: which of that original Guardian 1,000 books are in the 100 bestsellers of the last 12 years? The following FluidDB query will tell you:

has miro/books/guardian-1000 and has miro/bestsellers-1998-2010/title

(If I’d picked my tags better, this query would have been even simpler, but it’s not bad.)

As an illustration, if I issue that query, again asking for author and title, I get the following (using fdb):

fdb show -q 'has miro/books/guardian-1000 and has miro/bestsellers-1998-2010/title' /about /miro/books/title /miro/books/author

7 objects matched

Object ce180ce3-29b5-4abc-a031-64015b162f6a:
  /fluiddb/about = "book:birdsong (sebastian faulks)"
  /miro/books/title = "Birdsong"
  /miro/books/author = "Sebastian Faulks"

Object a2fa68ae-d409-422f-887a-dbdb7c1b4f18:
  /fluiddb/about = "book:atonement (ian mcewan)"
  /miro/books/title = "Atonement"
  /miro/books/author = "Ian McEwan"

Object d5ff7995-2ae6-4ba8-8549-ea1d0726484c:
  /fluiddb/about = "book:the kite runner (khaled hosseini)"
  /miro/books/title = "The Kite Runner"
  /miro/books/author = "Khaled Hosseini"

Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
  /miro/books/title = "White Teeth"
  /miro/books/author = "Zadie Smith"

Object 7e076540-3e14-4232-8c46-13863bae77ec:
  /fluiddb/about = "book:the curious incident of the dog in the night time (mark haddon)"
  /miro/books/title = "The Curious Incident of the Dog in the Night-Time"
  /miro/books/author = "Mark Haddon"

Object 5be745bd-500d-458b-b4e6-dd08972b73f6:
  /fluiddb/about = "book:to kill a mockingbird (harper lee)"
  /miro/books/title = "To Kill A Mockingbird"
  /miro/books/author = "Harper Lee"

Object 3b416fa5-51ab-4160-9820-240a0591c3a2:
  /fluiddb/about = "book:the time travelers wife (audrey niffenegger)"
  /miro/books/title = "The Time Traveler's Wife"
  /miro/books/author = "Audrey Niffenegger"

(I’ve added some blank lines, but otherwise this is the raw output from fdb.)

Or perhaps I’d like to know all the books that sold over 2,000,000 copies. For that, the relevant FluidDB query is just:

miro/bestsellers-1998-2010/volume > 2000000

Again, illustrating with fdb, and this time asking only the for the about tag that FluidDB uses to identify the object, we get this (faintly depressing) list:

 fdb show -q 'miro/bestsellers-1998-2010/volume > 2000000' /about8 objects matched

Object b2ff54a0-d94e-4fe1-951f-a4bd839ba219:
  /fluiddb/about = "book:harry potter and the half blood prince childrens edition (j k rowling)"

Object e7fee95f-4dcd-458b-8893-b56352d455cf:
  /fluiddb/about = "book:the da vinci code (dan brown)"

Object 04033298-9be8-41b8-b9ef-d1b11b1adfb9:
  /fluiddb/about = "book:harry potter and the philosophers stone (j k rowling)"

Object 60c5bbea-2568-4a68-825f-ffc4cfb20f88:
  /fluiddb/about = "book:harry potter and the prisoner of azkaban (j k rowling)"

Object 9258d0da-a65a-471b-abdb-277b68ea1ea0:
  /fluiddb/about = "book:harry potter and the chamber of secrets (j k rowling)"

Object 04a4b407-7f21-450b-83c3-2d840ef6a133:
  /fluiddb/about = "book:deception point (dan brown)"

Object 23d5a20a-ba28-43b8-9265-2afd8c4019ee:
  /fluiddb/about = "book:twilight (stephenie meyer)"

Object 36bc89a5-d91c-4cdf-9389-ff2fbe833d59:
  /fluiddb/about = "book:angels and demons (dan brown)"

3. Combining Data Sources¶

One of the things that is really interesting about this example is to look at the seven books that overlap. For example, Audrey Niffenegger’s wonderful book, The Time Traveller’s Wife is on both lists. A core idea of FluidDB is that different information comes to be associated by being placed on the same FluidDB object. The about tag (fluiddb/about) can be used to choose the object. In the case of novels, that object is identified [1] by an about tag of the form book:title (author)—in this case, book:the time travelers wife (audrey niffernegger). Obviously, there’s room for ambiguity with case and punctuation etc., but there’s a library and a website that will sort most of that out for you.

When I uploaded data on the Guardian 1000 books, (as the miro user) there wasn’t all that much information—author, title, year and the fact that it was on the Guardian 1000 list is pretty much all that was there. For example, here is what Aldous Huxley’s Brave New World looks like:

(Live version here)

In the case of the best-sellers, the dataset contained a bit more information including sales volume, publisher, average selling prices and total sales value.

The marvellous thing is that books that are on both lists automatically get all the data from both sources, simply because they both chose the same FluidDB objects (e.g., the one with the about tag book:the time travelers wife (audrey niffernegger), which you can see live here in a modern browser), or as it is at the time I write:

When I published the second list, I found that it included books that I had already rated in FluidDB. For example, I had already (personally, as njr) rated Small Island, by Andrea Levy, and as a result, when (as the miro user) I published the list of bestsellers, my njr/rating was already on some of them.

I think this is a powerful example of the potential of FluidDB, one that would be even more potent if it had been someone other than I (albeit as Miró) who had published to the at least one of two lists previously. But the point is, anyone following the convention about where to put data about books in FluidDB could equally easily have published the data with the same result. As usage of the system increased, we will see this more and more.

Go Explore; Go Tag¶

This post just scratched the surface, but I hope it begins to show the real and tangible benefits of publishing data to FluidDB. The data becomes capable of being queried. Multiple data sources combine, sometimes in ways that had not been foreseen. It can be accessed visually, from a command line, or programmatically. And you can add your own data, whether it be annotations, ratings, comments, associations or whatever.

So go explore; and if you like, get an account and start tagging/publishing.

[1]

Regular readers will know that I am a rather strong advocate of conventions for about tags in FluidDB in general, and for this convention for books in particular. Anyone can publish any data to FluidDB using any objects or conventions they like; but, as this post illustrates, there are real benefits when different datasets concerning overlapping things use common conventions.