27 December 2011

The British Library Catalogue / British National Bibliography

I have added to Fluidinfo information on approximately 2.5 million books drawn from the roughly 3 million records in the British National Bibliography, which documents the British Library’s Catalogue.

As ever, I have used the book-u convention (implemented using the Python abouttag library) to select about tags for the objects, and have tagged the books in Fluidinfo under the book user. Data specific to the British National Biography (BNB) is stored in the namespace book/bnb, while more generic data (derived from the information contained in the Bibliography) is stored directly in the book namespace.

Here is an example of a book that has been augmented with data from the British National Library. The book is George Orwell’s Animal Farm, and it is illustrated using the About Tag visualizer. (If you can’t see the picture below, upgrade to the latest version of your browser or see here for information on why you might be having trouble.) The green tags are the new ones.

fluidinfo 1529c459- f3f2- 45e1- 90f4- 3ff3040ad6df alice/comment="So disappointing." alice/has-read alice/likes=False alice/rating=2 bert/comment="What a book: I love it!" bert/has-read bert/rating=8 book/author="George Orwell" book/bnb/contributors={} book/bnb/creator="Orwell, George, 1903-1950" book/bnb/id={"GB9689279", "GBB005647", "GB8416414", "GBA0Y6010", "GB9330497", "GB7301513"} book/dewey={"823.912", "823/.912", "823/.9/1"} book/isbn={"070898200X", "185715150X", "0582275245", "0582434475", "0435121650", "978141...} book/r=0.170613118849 book/source={"BNBrdfdc13.xml-201011150#088316", "BNBrdfdc13.xml-201011150#029879", "BNBrdf...} book/title="Animal farm" fluiddb/about="book:animal farm (george orwell)" girafind/books/author={"George Orwell"} girafind/books/language="["$_english"]" girafind/books/title="Animal Farm" miro/books/author="George Orwell" miro/books/forename="George" miro/books/guardian-1000 miro/books/surname="Orwell" miro/books/title="Animal Farm" miro/books/year=1945 miro/class="record" njr/index/about njr/rating=10 otoburb/has-read otoburb/rating=8

Notice that, because of the careful normalization inherent in the book-u convention, where the book is already in Fluidinfo, the new data has generally been added to the existing object corresponding to that book, as in the case above.

The core data that should almost always be present is:

  • the about tag fluiddb/about, normalized using the book-u convention:

    book:animal farm (george orwell)

  • the book/author tag, containing the best author information I was able to extract, in this case

    George Orwell

    Where there is more than one author, they are generally shown separated by commas, with the last joined with an and (with no Oxford Comma). For example, The Feynman Lectures on Physics, by Feynman, Leighton and Sands has

    $ fish show -a 'book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew l sands)'
    Object with about="book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew l sands)":
      /book/author = "Richard P. Feynman, Robert B. Leighton and Matthew L. Sands"

    or, graphically:

    fluidinfo aeaa654c- 35b0- 4b00- 866b- c7deda8959c4 book/author="Richard P. Feynman, Robert B. Leighton and Matthew L. Sands" book/bnb/contributors={"Leighton, Robert B.", "Sands, Matthew L. (Matthew Linzee)"} book/bnb/creator="Feynman, Richard P. (Richard Phillips), 1918-1988." book/bnb/id={"GBA901036", "GBA645628"} book/dewey={"530"} book/isbn={"0805390499", "0805390669"} book/r=0.893236319082 book/source={"BNBrdfdc16.xml-201011150#096269", "BNBrdfdc14.xml-201011150#132062"} book/title="The Feynman lectures on physics" fluiddb/about="book:the feynman lectures on physics (richard p feynman; robert b leighton; m..."

    The book/author tag has had a lot of processing done to it, as described below.

  • the book/title field, which is usually almost identical to that in the BNB data. In this case it is:

    Animal farm

    I have not altered the capitalization, which is therefore generally consistent with some entry in the BNB database (though I would really prefer it were in Title Case).

  • the book/source tag shows where the base data was taken from. This tag’s value is a set of strings, each of which corresponds an entry in one of the 17 files from which the BNB data was extracted. The entries consist of

    • the name of the file (always BNBrdfdcNN.xml) where NN runs from 01 to 17
    • a dash -
    • the datestamp on that file (always 20101115 at present)
    • the digit zero (0) and a # sign
    • the record number in the file, starting from 1, with six digits.

    Since multiple bibliographic entries can correspond to the same work, there is sometimes more than one of these.

  • the book/r tag is a pseudo-random floating point value with 0.0 ≤ book/r < 1.0.

Some of the raw data has also been added, with almost no cleaning up, under the book/bnb namespace. The BNB data uses the Dublin Core metadata standard, and includes:

  • bnb/creator, which is the person or organization primarily responsible for the creation of the work. This is sometimes blank, and is stored as a single string value.
  • bnb/contributors, which is a list of contributors, sometimes including the creator and sometimes not.
  • bnb/dewey is the set of Dewey Decimal classifications found on the records corresponding to this book.
  • bnb/isbn is the set of international standard book numbers found on the records corresponding to this book.
  • bnb/id is the set of British Library IDs found on the records corresponding to this book. (I’m not entirely clear what this identifier is, but it appears to be important and well populated.)

Other information is available in the data (including classification information), and I will probably extract this and add it at a later time.

Finding, Inspecting and Tagging Books in Fluidinfo

There are multiple ways of retrieving book data from Fluidinfo and of tagging it.

  • Probably the easiest and most general method is to go to http://artoftagging.com and do a search that involves a book and some keywords from the title and/or author. A list of results should come back and you can see a visualization of any of them by clicking the link If you have a Fluidinfo account, you can create an account at artoftagging.com and then save your Fluidinfo details there. Once logged in, you will then be able to add your own tags to any object you find.

  • If you just want to construct the about tag for a book, you can do that using the online version of the Fluidinfo Shell, Fish. Once there, type, for example:

    fish> about book "Animal Farm" "George Orwell"
    book:animal farm (george orwell)
    fish> about book "The Feynman Lectures on Physics' 'Richard P. Feynman"
    "Robert B. Leighton" "Matthew L. Sands"
    book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew l sands)

    (The quotes tell Fish that “Animal Farm” is the title and “George Orwell” a single author.) Alternatively, you can download and install Fish on your own machine. (It is available from Github.) You can then type the same commands, after fish, e.g.:

    $ fish> about book "Animal Farm" "George Orwell"
    book:animal farm (george orwell)

    You can then use any Fluidinfo tool, including the new Object Browser, to work with that object, signing in with Twitter if you like.

  • Another easy way of finding an about tag for a book is to find it on Amazon (US or UK, for now) and use the az-fish bookmarklet available at the top of the online Fish (drag it to your browser’ toolbar). The bookmarklet will take the item on the current Amazon page and issue the appropriate Fish command to find the about tag. (You don’t need to log into Fish or Fluidinfo to do this.)

The Hierarchy of Books: Works and Manifestations

The International Federation of Library Associations (IFLA) describes a hierarchy of four kinds of “book” entities in its report Functional Requirements for Bibliographic Records. These are:

  • works
  • expressions
  • manifestations
  • items.

Quoting from that report:

“The entities defined as work (a distinct intellectual or artistic creation) and expression (the intellectual or artistic realization of a work) reflect intellectual or artistic content. The entities defined as manifestation (the physical embodiment of an expression of a work) and item (a single exemplar of a manifestation), on the other hand, reflect physical form.”

Loosely, a work is the conceptual book, usually described by the combination of a title and author—Animal Farm by George Orwell.

The report describes an expression of a work as “the intellectual or artistic realization of a work in the form of alpha-numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms.” Thus George Orwell’s Animal Farm can be translated into different languages, laid out differently, typeset on pages, or in digital form, or recorded as spoken words, and these correspond to different expressions of that same book. There may also be different editions, printings etc., which may have slightly different content. Again, these are different expressions of the same conceptual work. (Occasionally, expressions may encompass several works, such as in the case of compendia.)

Moving down the hierarchy, a manifestation is a particular rendering of a work into physical form — “the physical embodiment of an expression of a work.” Note that “[A]s an entity, manifestation represents all the physical objects that bear the same characteristics, in respect to both intellectual content and physical form.” Thus, all the copies of the same printing of the same edition of Animal Farm that are essentially indistinguishable collectively correspond to a manifestation of George Orwell’s Animal Farm.

Finally, an item is an individual copy of a book: “a single exemplar of a manifestation.”

The entries in the British Library’s catalogue correspond literally to items, but conceptually to manifestations, but the objects to which I have attached the data in Fluidinfo correspond to works. This is why the c. 3 million records reduce to c. 2.5 million Fluidinfo objects, and why some of the objects have multiple ISBNs etc. It is entirely possible to create further objects at the level of manifestations (and even items, if someone really wants to do so), and even more so at the level of expressions, but I have not done this yet.

The reason I have concentrated on works rather than manifestations is that this seems much the most important level to represent in a system like Fluidinfo: with important exceptions, when people want to rate or comment on a book, it is most often the work, rather than the manifestation, that they are interested in. Moreover, collecting together information about the different ISBNs associated with a single work is positively helpful. That is not to say that there isn’t a case for creating other objects at the level of expressions or manifestations.

Further Work

There is a great deal more that can be usefully done with the fabulous data from the British Library. While I am not committing to doing these, tasks on my list list include:

  • Authors. Creating an object corresponding to each creator/author/contributor. I plan to use about tags of the form author:normalized name (birth-year) for these, e.g. author:George Orwell (1903). The required data is largely available in the BNB dataset. I would then plan to add a book/related-authors tag to each book, pointing to its authors’ objects and, on the author objects, corresponding sets of book/related-books tags pointing back to their works.

  • Upload Checking. Checking the everything uploaded OK. I count 2,558,738 unique books (as works) in the BNB dataset, and I appeared to upload all of these successfully (getting HTTP 204 statuses back from Fluidinfo). However, when I count objects having a book/r tag, I get only 2,468,661, a shortfall of 90,077.

    Whether this indicates a problem or not is unclear, as if I count the number of books with a book/source but no book/r, with the query

    has book/source except has book/r

    it reports 18,921 such books, but as far as I can tell, all those it finds in fact have a book/r, so it appears that Fluidinfo is having some difficulty executing some queries correctly at the moment.

  • About Tag Checking. I had to use some fairly hairy code to coerce the BNB data into the correct form to generate canonical about tags in the book-u convention, and it has definitely failed in some cases. For example, I have seen at least one example where the surname of an author in the BNB data preceded the forename but without a comma, so that forename and surname will have been reversed. To the extent that I can detect these problems, I will try to fix them.

  • Recent additions. I believe the British Library has issued updates with recent additions (since November 2010); I certainly plan to get that data and import it in a similar fashion, and then to set up a CRON job to do that regularly. In this way, I hope the dataset will be living and always current.

  • Categorizations. The BNB data includes subject categories for the records, which I have not imported thus far. I will do so.

  • Year information. There is information about publication dates in the BNB data, but it is not in a very structured form. If I am able to extract it with a satisfactory degree of reliability, I will get this too. Obviously, different manifestations will have different publication dates, so this will probably be a set-valued tag.

Enjoy the data, and let me know if you find problems.

I expect I will write a number of other posts on issues associated with this data.

1 comment:

  1. I've been gone for awhile doing other projects and managed to find my way back here. I'm glad to see girafind noted :-) though I think it could described as being on indefinite hiatus. I'm trying to keep to simpler projects for the time being... working on my programming skills and whatnot.