29 March 2010

The Guardian 1,000 Books Electronically Tagged

Books have held a special place in the affections and history of the FluidDB team from the very beginning and featured in a remarkably high proportion of Terry’s early motivational examples for FluidDB. Perhaps coincidentally, but perhaps not, at least five of the people involved (at least) also took a personal interest a list of 1,000 novels that everyone must read that The Guardian newspaper published in January 2009.

So it is with some pleasure that I last night finally published a dataset to FLuidDB containing those 1,000 novels, an object for each. Finally, @terrycojones,, @barshirtcliff, @rustlem, @anamosterin and I have objects on which to hang our various terrycojones/has-read and barshirtcliff/rating tags.

The first book in the Guardian 1,000 (sorted alphabetically, by author), is The Face of Another by Kobo Abe. This is tagged as follows:

> fdb tags -a "book:the face of another (kobo abe)"
Object with about="book:the face of another (kobo abe)" (id 0fe6c95d-9e6b-45c4-b228-b8fef5c42bff):
  /fluiddb/about="book:the face of another (kobo abe)"
  /miro/books/title="The Face of Another"
  /miro/books/author="Kobo Abe"

or, graphically:


and the last is The Debacle by Émile Zola:

> fdb tags -a "book:the debacle (emile zola)"
Object with about="book:the debacle (emile zola)" (id 49cee61c-85e2-4bce-9380-0c07dc58dc86):
  /fluiddb/about="book:the debacle (emile zola)"
  /miro/books/title="The Debacle"
  /miro/books/author="Émile Zola"

[UPDATE: I should probably have mentioned when I posted this originally that the tags command is not present in the version of fdb on github. (Apologies for this.) I will add it, but it will take a little while. I have two different implementation of fdb, the standalone one on github and one that is a fully integrated part of my Miró software. Commands lines starting with > in this posting come from the Miró version; command lines starting with $ are from the standalone version on github. There is actually some method to this madness, but also some scope for confusion, for which apologies. Implemented in 1.26, now on guthub.]


The convention for about tags that I’ve used is the one described in this previous posting. For what it’s worth, I’m warming to this convention, not least because (unlike the ISBN) if you know the title and author of the book, it is fairly trivial to construct the almost-always unique about tag. The python library used to generate the about tags is available at github.

Finding books in this list is fairly easy if you have any kind of access to FluidDB. Using fdb, there are various options:

  • Finding specifically books in the Guardian 1,000, simply look for objects with a miro/books/guardian-1000 tag set to True.

    $ fdb count -q 'has miro/books/guardian-1000'
    1000 objects matched
    Total: 1000 objects
  • More generally, to find books in the books table published, look for objects with (for example) a miro/books/title tag. At the moment, this is the same set, since I haven’t added anything else; but I will!

    $ fdb count -q 'has miro/books/author'
    1000 objects matched
    Total: 1000 objects
  • To find a specific book, you can obviously query the title, the author, both or use the about tag; because of the normalization on the about tag, that will often be the easiest and most reliable way of retrieving a particular book. Here, for example, are three queries, all of which find (inter alia) Zola’s La Bête Humaine:

    $ fdb show -a 'book:la bete humaine (emile zola)' /about /miro/books/title /miro/books/author
    Object with about="book:la bete humaine (emile zola)":
      /fluiddb/about = "book:la bete humaine (emile zola)"
      /miro/books/title = "La Bête Humaine"
      /miro/books/author = "Émile Zola"
    $ fdb show -q 'miro/books/title="La Bête Humaine"' \
            /about /miro/books/title /miro/books/author
    1 object matched
    Object 58560935-d600-4921-a7d4-389e7bd068b5:
      /fluiddb/about = "book:la bete humaine (emile zola)"
      /miro/books/title = "La Bête Humaine"
      /miro/books/author = "Émile Zola"
    $ fdb show -q 'miro/books/surname="Zola"' /about /miro/books/title /miro/books/author
    4 objects matched
    Object 58560935-d600-4921-a7d4-389e7bd068b5:
      /fluiddb/about = "book:la bete humaine (emile zola)"
      /miro/books/title = "La Bête Humaine"
      /miro/books/author = "Émile Zola"
    Object 49cee61c-85e2-4bce-9380-0c07dc58dc86:
      /fluiddb/about = "book:the debacle (emile zola)"
      /miro/books/title = "The Debacle"
      /miro/books/author = "Émile Zola"
    Object 57590e71-beff-4fab-9a4c-b23b9574dbb3:
      /fluiddb/about = "book:germinal (emile zola)"
      /miro/books/title = "Germinal"
      /miro/books/author = "Émile Zola"
    Object c5bba025-9c4a-4c3b-81df-1b7e1bed4653:
      /fluiddb/about = "book:therese raquin (emile zola)"
      /miro/books/title = "Therese Raquin"
      /miro/books/author = "Émile Zola"

The Table Structure

Like the elements from the periodic table and the planets datasets that I published previously, this dataset was published as a table straight from Miró. Unlike those datasets though, I haven’t added any record numbers, about-tag links or id links, as it seems to me that record order is completely immaterial here and I expect to update the books dataset regularly. (I’ll probably add the Orange Prize winners next, and perhaps booker winners; adding the Orange Prize winners will at least get Anne Michaels in.)

Other Notes

The raw data was scraped from the Guardian websites at the time of publication, as described here though they have more recently produced a definitive list themselves. I corrected a few errors, but largely the data is just as it appeared at the time.

In compiling this list, I have converted the publication year into a number, i.e. the miro/books/year tag is numeric. This required me to decide on an approach for the small number of cases in which the Guardian-supplied year was actually a range of dates; I simply took the earliest year, which obviously leads to a small loss of information. The 25 records affected were:

title author year
The New York Trilogy Paul Auster 1985-86
a Comédie Humaine Honoré de Balzac 1830-1848
Epileptic David B 1996-2003
Bleak House Charles Dickens 1852-53
Little Dorrit Charles Dickens 1855-57
The Count of Monte Cristo Alexandre Dumas 1844-45
Parade’s End Ford Madox Ford 1924-28
To the Ends of the Earth trilogy William Golding 1980-89
The Earthsea series Ursula K Le Guin 1968-1990
L’Histoire de Gil Blas de Santillane Alain-René Lesage 1715-1735
The Chronicles of Narnia CS Lewis 1950-56
Cairo trilogy Naguib Mahfouz 1956-57
The Fortunes of War novels Olivia Manning 1960-80
The Man Without Qualities Robert Musil 1930-32
U.S.A. John Dos Passos 1930-36
A Dance to the Music of Time Anthony Powell 1951-75
The Discworld series Terry Pratchett 1983-
Remembrance of Things Past Marcel Proust 1913-27
His Dark Materials Philip Pullman 1995-2000
Gargantua and Pantagruel François Rabelais 1532-34

Happy tagging!

No comments:

Post a Comment