26 August 2009

The Guardian 1000 Novels Everyone Must Read

I wanted to create a few objects that would begin to show some other aspects of FluidDB.

Some may remember that in January The Guardian ran a series of articles entitled 1000 Novels Everyone Must Read. They published the list in themed sections, which made a lot of sense for the paper, but which made it surprisingly hard to see what the 1,000 were, so I spent a while building a digital version of the list which I blogged about and published here. (Ironically, they then did the same thing, but my work wasn’t entirely wasted, because at least I ended up with nice, clean, structured, electronic version of their list.

But . . .

Wouldn’t it be great if people could stick ratings on those 1,000 novels, or indicate which ones they’ve read, and also augment the list with novels they think should have been on there. And wouldn’t it be insanely great (in that strangely familiar phrase) if you could then find out things like

  • which novels your friends have read and rated highly that you haven’t read
  • which of your friends seem to like the same novels as you
  • who of your non-friends like the same novels as you
  • which novels have the highest ratings from users.

This is exactly the kind of thing FluidDB was built for.

So, I’ve made a start by creating FluidDB objects for some [1] of the novels.

The Hitchhiker’s Guide to FluidDB

In order to load information into FluidDB in a useful way, we need to decide how we want to represent it. FluidDB is fantastically relaxed about this, and it is not necessary for everyone to agree. But at the same time, for users to share data meaningfully, and even for a single user to be able to find information systematically, fixing a structure is really useful.

This is what I am using so far.

http://StochasticSolutions.com/fluiddb/image/hitchhiker.png

In principle, it seems to me that the ISBN number is the natural way to identify a book (despite some limitations), so I’ve chosen that as the about tag (even though they’re painful for me to look up).

In a perfect world, t tags in grey would ideally have some namespace with more authority than njr (ideally isbn.org:), and in time, that may come, quite possibly using the very same object (edfaaf70-b9b5-42f1-ad4c-30601b70a2ac). But for now, I’ve stuck the basic metadata in my namespace (njr), and that should be enough to get things rolling. These tag were put on programmatically using a few lines of python.

The social tag, of course, is the black one—njr/rating = 9. Unlike the others, which were generated automatically, I put that one on myself, by hand. There are lots of ways of doing this, but I used FDB, with the command

$ fdb tag -av isbn:978-0330258647 rating=9
Tagged object with about="isbn:978-0330258647" with rating = 9

So: 10 down 990 to go. I will add the rest as soon as I can find a reasonable way of getting their ISBN numbers.

Tag Conventions

While it is fundamental to FluidDB that each user and application can make their own choices, I believe good conventions are useful and will nurture the growth of FluidDB. (Terry refers to my desire for conventions as my fascist librarian tendencies, but I’m sure he means it kindly.) I will maintain my personal suggested conventions on this blog on a post I’ll make soon. The first ones will be these:

The About Tag

  • Books: I suggest a primary convention that matches about="isbn:978-0330258647". [2] This is, the four lower case letter ‘isbn’, followed by a colon, followed by the 13-digit ISBN number.
  • Books without an ISBN number: I am going to use about=”book:Name of Book/Author” for now, but that’s not very precise. (Maybe eventually FLuidDB ID’s will replace ISBN numbers.)

Other Tags

  • Ratings: username/rating. Mostly because Terry Jones (the father of FluidDB) has effectively promulgated this with his every description of FluidDB, I suggest ratings on a scale of 0 (worst) to 10 (best). Feel free to use decimals if 11 ratings aren’t enough. Obviously, anyone can go higher (or lower) if they want, but when I’m calculating means and things, I’ll ignore out-of-bounds values.

The Guardian 10 Novels Everyone Must Read

As noted before, I’ve currenlty only added objects for the first ten. This is so tiny, I thought I might as well show the FDB output illustrating what’s there. If anyone does have access, do tag these with your rating, or a has_read tag or whatever (as if you need my permission!).

At the time of writing, The Hitchhiker’s Guide to the Galaxy has three ratings; but I bet that increases really fast.

fdb show -q 'has njr/guardian-1000' /about title publication-year author-forename author-surname guardian-1000
10 objects matched
Object edfaaf70-b9b5-42f1-ad4c-30601b70a2ac:
  /fluiddb/about = "isbn:978-0330258647"
  /njr/title = "The Hitchhiker's Guide to the Galaxy"
  /njr/publication-year = 1979
  /njr/author-forename = "Douglas"
  /njr/author-surname = "Adams"
  /njr/guardian-1000 = 1

Object 9d49f0b5-fb0b-49c8-b4fc-d82eb35d90e1:
  /fluiddb/about = "isbn:978-1569470039"
  /njr/title = "Silver Stallion"
  /njr/publication-year = 1990
  /njr/author-forename = "Junghyo"
  /njr/author-surname = "Ahn"
  /njr/guardian-1000 = 1

Object 1be19311-1f81-4f70-ba7a-4075d9b06b4d:
  /fluiddb/about = "isbn:978-9774246036"
  /njr/title = "al-Bab al-Maftouh"
  /njr/publication-year = 1960
  /njr/author-forename = "Latifa"
  /njr/author-surname = "al-Zayyat"
  /njr/guardian-1000 = 1

Object f0130f4f-2bc2-4ea4-8e52-d6d4922969a2:
  /fluiddb/about = "isbn:978-0701206048"
  /njr/title = "Death of a Hero"
  /njr/publication-year = 1929
  /njr/author-forename = "Richard"
  /njr/author-surname = "Aldington"
  /njr/guardian-1000 = 1

Object e65b1540-1fe4-46b5-b14d-da32ae592dfe:
  /fluiddb/about = "isbn:978-0141188539"
  /njr/title = "The Face of Another"
  /njr/publication-year = 1964
  /njr/author-forename = "Kobo"
  /njr/author-surname = "Abe"
  /njr/guardian-1000 = 1

Object 9cdfb63a-38e2-4c27-a739-5107ec24d151:
  /fluiddb/about = "isbn:978-1857989984"
  /njr/title = "Non-Stop"
  /njr/publication-year = 1958
  /njr/author-forename = "Brian W"
  /njr/author-surname = "Aldiss"
  /njr/guardian-1000 = 1

Object 14371df6-73d9-43e0-96e5-2008e852d67f:
  /fluiddb/about = "isbn:978-0140621198"
  /njr/title = "Little Women"
  /njr/publication-year = 1868
  /njr/author-forename = "Louisa May"
  /njr/author-surname = "Alcott"
  /njr/guardian-1000 = 1

Object 97401884-3590-4c22-b9dc-0cd2744270eb:
  /fluiddb/about = "isbn:978-1841955612"
  /njr/title = "The Man with the Golden Arm"
  /njr/publication-year = 1949
  /njr/author-forename = "Nelson"
  /njr/author-surname = "Algren"
  /njr/guardian-1000 = 1

Object 4a96c86b-1117-4c93-a815-1494c13fc2cf:
  /fluiddb/about = "isbn:978-0141186900"
  /njr/title = "Anthills of the Savannah"
  /njr/publication-year = 1987
  /njr/author-forename = "Chinua"
  /njr/author-surname = "Achebe"
  /njr/guardian-1000 = 1

Object f95a950e-1b1a-425d-b4ac-b8fc82c99cb3:
  /fluiddb/about = "isbn:978-0141023380"
  /njr/title = "Things Fall Apart"
  /njr/publication-year = 1958
  /njr/author-forename = "Chinua"

  /njr/author-surname = "Achebe"
  /njr/guardian-1000 = 1
[1]OK, only ten so far, but there’s a reason. Doing them all would be really easy if I knew a good way of looking up an ISBN number programmatically. If anyone knows one, please let me know (by leaving a comment, or any other way). Otherwise I’ll have to resort to screen-scaping, which is completely doable, but dull and requires manual intervention when it goes wrong.
[2]When I say the about tag and write about="...", this is shorthand for fluidbdb/about. This tag is special in that it is unique (only one object can ever exist with a given value of the about tag), and immutable. This is a basic property of FluidDB, and means that when a rating is attached to the object with about="isbn:978-0330258647", we can be confident that the object will forever more be about that ISBN number.

15 comments:

  1. Thanks for that Nick, very useful. The most interesting part for me was your attempt to set down some conventions for tag semantics and value ranges. I sense a disturbance in the (evolutionist) force, a glitch in the biological matrix, etc, etc. As the speaker said: order, order, order!

    ReplyDelete
  2. Thank you for this article describing your view of tagging the universe.

    Unfortunately this example also shows the difficulties of this endeavour, namely the discoverablity of objects via their about tag.
    In this case you used the ISBN number which, while I cannot offer any better attribute of books, is not very suitable.
    Firstly, your formatting included one hyphen. Why one? ISBN numbers are highly structured (read up on it on Wikipedia) and hyphens are usually used as separators between the parts of an ISBN but not everywhere and not consistently. So maybe it would have been better to just drop all hyphens and make it easier for others to discover the object even if they don't know how exactly you entered it.

    Secondly, there are two kinds of ISBN -- 10 digit and 13 digit. IIRC I have seen books that had both printed on them. Now, as I understand it, you can convert a 10-digit version to a 13-digit by prefixing it with "987" but will users trying to find the object for their book know that?

    Thirdly and most importantly ISBNs identify a publisher's version of the work. I hope we can agree that Mr Douglas's classic work is basically the same (at least for the purpose of general review that you outlined), no matter if you have the British version, the US version, a version part of an omnibus edition, etc. But all these will have different ISBNs (case in point: my HHGTTG copy's number is 0-671-52721-5).

    So this begs the question if you should have made the ISBN a regular tag (list valued so one can associate more than one ISBN with a work)?

    ReplyDelete
  3. Holger:

    These are all good points. I'll take them in turn.

    1. I agree (now) that the ISBN number isn't as good as I thought, for exactly the reason you state, namely that there is a many-to-one mapping from isbn numbers to (conceptual) books. I hadn't really realised this; at least not clearly enough. That's a huge problem with my scheme.

    2. I know about ISBN10 vs. ISBN 13, but that worries me less. But you're right, it's an issue.

    3. Formatting. Me culpa. The truth is, I took the format from Amazon and assumed it was standard (though if I'd thought about it, I'd have realised it wasn't). In general, I'm a big believer in software accepting separators in numbers because it makes them so much easier for humans. But I should have found the standard. You're right, there's definitely a case for makeing the tag without any separators, though in fact I think I'm more likely to go for separators in standard places, assuming there is a universal standard for ISBN-13 formatting. (I don't mean universally used; just universally applicable).

    Lots to think about and all good points. It's early days, and fortunately (since these aren't global standards and its so early) it's easy for me to edit my recommendations.

    The fundamental problem you point out (the non-uniqueness of ISBN numbers) is a big deal, and I may decide to try to find something better. Of course, it could be that this "something better" is a FLuidDB object ID, and that about tag really should be set to some title/author combination in standard form. (And yes, I realise all the problems with that "standard form"...)

    Thanks for the input. I'll think a little before putting up the other 990, which will take me a couple of days anyway, I think.

    What I am unlikely to be deflected from is the attempt to find a meaningful about tag that identifies books, because in my my view, that is pretty much necessary in order for cross-user queries to work reliably and take off.

    ReplyDelete
  4. Incidentally, I hope people have noticed that FluidDB has allocated an ID for HHGTTG that includes a 42 hex pair, as well as an (admitttedly non-byte-aligned) 2a, which is 42 in hex.

    It's scary.

    ReplyDelete
  5. Libraries and associated bodies have been dealing with managing book data for quite a while! The current thinking in this area is something called 'FRBR' - which defines different levels of description in terms of the following entities:

    Work
    Expression
    Manifestation
    Item

    (see http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records)

    In this model, the Guardian list is a list of works, whereas ISBNs are part of a specific manifestation (as would be other details specific to a published edition).

    However, the problem is that there is no agreed identifier for works.

    You could look at:

    LibraryThing - this has an API, and groups books into 'works' bringing together the various editions
    FictionFinder - from OCLC (a library organisation), which presents 'work' level records for books (user interface at http://fictionfinder.oclc.org/ and some more information at http://www.oclc.org/research/projects/frbr/fictionfinder.htm)
    Open Library - http://openlibrary.org/about/frbrization

    To be honest it's tricky stuff, but there are people around doing a lot of work on it

    ReplyDelete
  6. Owen. Yes; it;s more of a minefield than I realised. Sanghyeon Seo pointed me at FRBR too. I'll take a look.

    I wonder whether it's too fanciful to think that FluidDB could actually help with its 256-bit immutable object IDs.

    Maybe we do just establish a convention for describing a work and then use the FluidDB ID as the reference code. Clearly there'd be a major problem defining the canonical form for a work, but starting with title and author in an agreed (algorithmic) encoding would be a start. Obviously, accents, punctuation, editions etc. would all be issues.

    But that's probably simplistic.

    I stepped into a minefield!

    Thanks for the input.

    ReplyDelete
  7. Oh - forgot the LibraryThing URL http://www.librarything.com/

    Also, you might be interested in this paper from Rob Styles et al on RDF representations of book data. Specifically the bits on creating URIs as identifiers might possibly be of relevance:

    http://dynamicorange.com/uploads/Semantic%20Marcup.pdf
    (Rob Styles works for http://www.talis.com who do lots of semantic web/linked data stuff but also do library systems)

    ReplyDelete
  8. Final comment! (I promise) Obviously you weren't setting out to solve all these problems, but rather do some stuff with FluidDB and add some interesting social functions to this list of books!

    You asked about automatic grabbing of ISBNs. Most library systems support m2m interface. You could look at the WorldCat API (http://www.oclc.org/productworks/worldcatapi.htm), or possibly the LibraryThing API (http://www.librarything.com/services/webservices.php), or finally COPAC (catalogue from major research Unis in the UK) can return records in an XML format - http://copac.ac.uk/development-blog/tag/api/

    I think you'd find LibraryThing is the best match to the kind of thing you want to do, although not sure about the T&C on their API (generally the guy behind LibraryThing seems pretty open to doing interesting stuff, but he has got a business to run!)

    ReplyDelete
  9. Comment as much as you like Owen: it's all good stuff.

    But detailed replies will have to wait: I need to work.

    I'll check out Library Thing though.

    Leaning towards

    about="The Name of the Book//The Book's Author"

    as a temporary convention till the librarians of the world get their act together :-)

    Thanks for all the input.

    ReplyDelete
  10. Though I'm sure this isn't novel, I'm compelled to highlight that the challenge discussed in these comments -- the lack of (and need for) a truly canonical identifier for books -- applies to just about every kind of entity you might want to annotate in a db. This has been the source of my primary skepticism about the fluiddb concept since I first started thinking about it a few weeks ago: how can an open data-store be socially useful without standardized formulae for identifying the records?

    In fact, this same problem arises for ALL fields in the DB: both in the tag names (why "rating" instead of "score?") and in their values (I might pick a rating scale 1-5, like many popular review websites, and pollute your rating data).

    I anticipate that the social value offered by something like fluiddb will depend heavily on the availability (and adoption) of a supporting system for specifying and referring to SCHEMA descriptions. Something like XSD, allowing applications to describe the conventions employed for storing specific types of data. This may also require an additional data axis in the db, to allow a specific tag to be annotated with its datatype from a given schema.

    Without some sort of metadata store like this, I fear your "open db" may quickly devolve into an unmanagable, unstructured mess with no more interoperability than today's "open web."

    ReplyDelete
  11. bitsucker: I agree with most of this. I'm not sure it requires an architectural change though. There is an object corresponding to each "abstract tag" in FluidDB (an abstract tag, in my terminology, being roughly the set of tags sharing a name). So if you use a tag bitsucker/rating, there is an object corresponding to the (abstract) tag bitsucker/rating, and you could attach a schema specifier on there.

    There's a broad range of view on the importance of taxonomies in FluidDB. Personally, I veer towards the "librarian end", but others are more relaxed.

    Anyway, thanks for the comment...we shall see how it evolves.

    ReplyDelete
  12. Anonymous31/8/09 02:42

    Yup, I encountered that "abstract tag" construct right after posting my previous comment (I'm bitsucker, btw). I agree that this would be a natural place to identify the datatype for a tag.

    But I still think this just begs the question: if different apps design their own mechanisms for describing their data-types, third-party mashups will be forced to develop and maintain "interop glue" for each new convention.

    Perhaps this is just me leaning towards the "librarian end" myself, but my intuition is that some sort of standard for schema metadata would be hugely beneficial. Whether it is designed into the DB itself, or adopted by convention "after market," someone will need to establish a canonical way to encode datatypes -- to say, for example, "this is a value from 0-10 indicating the owner's opinion of the quality of the tagged object." Without this, the data will necessarily be unstructured soup.

    If you agree with that, then I further suggest that it'd be preferable to at least offer a suggested "canonical encoding" for this metadata. Even an RFC with a starting proposal for this could help to herd the community towards an eventual standard. You have the opportunity now, in the beginning, to guide the users in how they construct and annotate their data. If you wait and hope that the community will converge on its own standard, it may be much less likely to succeed.


    I should confess, of course, that I'm just getting started in reading about your project and its community; I may be unaware of some discussion that's already been shared in this area. If so, please set me straight! And congrats on coming this far with a very exciting project; I hope it lives up to its full potential.

    ReplyDelete
  13. How does FluidDB help with the recommendation side of this ("which of your friends seem to like the same novels as you")? The article's great as far as it goes, but I'm intrigued how "this is exactly the kind of thing FluidDB was built for" is the case, and how it could help me to answer similar questions.

    ReplyDelete
  14. Anonymous28/1/10 21:13

    The "canonical" (persistent) identifier in the publishing industry is increasingly the Digital Object Identifier (DOI); see http://doi.org

    The fascinating thing for me about FluidDB is that a few of us active in the DOI/Handle System community over the last decade++ have imagined associating metadata with objects in this fashion, but the infrastructure has been too heavy. With FluidDB, it isn't!

    ReplyDelete

Labels