22 March 2010

The Perfect About Tag: Books in FluidDB (redux)

[UPDATE 28th March 2010: I have now added a books submodule to the python abouttag module, and updated the examples below to reflect the syntax for that.

Simple usage is:

from abouttag.books import book
print book(u'Fugitive Pieces', u'Anne Michaels')

book:fugitive pieces (anne michaels)

Add extra authors as extra arguments (see examples below).]

I want to tag my favourite book. I want to proclaim my enduring love of Fugitive Pieces, by Anne Michaels , so that no man, woman or child can ever have any doubt that this, for me, is the finest novel ever written. I want to give this book a rating of a perfect 10.

njr/rating = 10

So where do I put it?

I previously suggested (fool that I am) that a good place to put it might be one an object whose about tag is “isbn:0 7475 3282 6”, for that is the ISBN number of the well-thumbed copy in front of me, and what better way of identifying a book could there possibly be than an International Standard Book Number?

Well, several, it transpires. The problem is actually apparant as soon as you look for the ISBN number. My copy actually says:

ISBN 0 7475 2939 6 (hardback)
ISBN 0 7475 3282 6 (paperback)

How fantastic is that? Not one unique International Standard Book Number, but two.

It gets worse.

If I go to http://amazon.co.uk I find

ISBN-10: 0747534969
ISBN-13: 978-0747534969

on one edition,

ISBN-10: 0747529396
ISBN-13: 978-0747529392

on another

ISBN-10: 0747599254
ISBN-13: 978-0747599258

on another, and

ISBN-10: 0747590095
ISBN-13: 978-0747590095

on another. If I go to http://amazon.com, I find

ISBN-10: 0679776591
ISBN-13: 978-0679776598

What about http://amazon.ca? After all, Anne Michaels is Canadian.

ISBN-10: 0771058829
ISBN-13: 978-0771058820

This illustrates a number of interesting points.

  • ISBN numbers are (it transpires) at the wrong level of the book hierarchy. There has been some excellent work under the monicker of FRBR (Functional Requirements for Bibliographic Records [1]). Among other things, this distinguishes between four levels in the ‘book’ hierarchy:

    • The work — In this case, Fugitive Pieces by Anne Michaels. (This is the level we probably want in most cases for FluidDB.)
    • An expression — Something like a rendering or translation, or concrete form of a work. In the case of a book, typically a text, in a particular language.
    • A manifestation — an edition, by a publisher; it transpires that ISBN numbers identify manifestations (editions), not works or expressions.
    • An item — the 294-odd pages between soft covers sitting on the desk in front of me is an item — a physical book; an instance of a manifestation, in geek-speak.
  • Even if we were interested in tagging particular editions (manifestations) of a book, There would still be some normalization questions.

    • Inside my book (‘item’), the ISBN number for the paperback edition (which mine is) is listed as

      ISBN 0 7475 3282 6

    • On the back cover, above the bar code, it is listed as

      ISBN 0-7475-3282-6

    • On amazon.co.uk, this edition would presumably be listed as:

      ISBN-10: 0747532826
      ISBN-13: 978-0747532826

      if it were there at all.

So even here, if people want to tag the same object, some normalization is required. It wouldn’t really matter which form we adopted, as long we chose one. For the sake of argument,

ISBN 0-7475-3282-6

Requirments for the Perfect About Tag

So if we are constructing a convention for about tags for some category of items, the following might be desirable:

  • The tag should sit the right level of relevant hierarchies — ideally, there should be a one-to-one correspondence between the different items in the category and about tags.
  • Trivial formatting differences should be removed by normalization: for example, in the case of ISBN numbers, adopting a convention such as ISBN (in capitals) followed by SPACE followed by SINGLE DIGIT followed by HYPHEN followed by FOUR DIGITS followed by HYPHEN followed by FOUR DIGITS followed by HYPHEN followed by SINGLE DIGIT.
  • It should be easy to determine the relevant about tag for a given object. This is similar both for tagging (information creation) and finding (information retrieval): in both cases, if we know what it is we’re talking about, it should be easy to figure out what the FluidDB GET will retrieve it, or PUT transaction will tag it.

Books

If the ISBN number is not really a suitable basis for an about tag for books, what might be? After quite a lot of reading, it’s not obvious to me that anything exists today that really comes close to satisfying the requirements above.

However, if we narrow our scope a little, say to western-langage books, we can start to think about the following:

  • A book, at the top (work) level of the hierarchy, seems to me to to be identified by a title and an author. So perhaps basing the about tag on those two directly would make sense.
  • It seems certain that a degree of normalization will be helpful, as a minimum reducing ambiguity of punctuation, different kinds of dashes etc. And (at the risk of trampling all over cultural sensitivities, there are clearly some pros and con associated with more extreme normalization, such as adopting a standard case, possibly removing accents (in a well defined way) etc. (Note, this is purely for the purpose of defining a normalized about tag; there is no suggestion here that either author or title should be stored in anything but their full unicode glory.)
  • Clearly, there are various ways in which clashes and ambiguities could occur with such a scheme (How are subtitles handled? What about editions? What about books with the same name and author? Or even the same name and author after normalization? But we have to start somewhere.)

Starting from these two simple ideas, we might construct a conceptual about tag of the general form:

book:title (author)

If we initially assume no normalization and that the author will appear as on the cover of the book, Fugitive Pieces would then become

book:Fugitive Pieces (Anne Michaels)

Normalization

Sadden me though it does, on some reflection I think that the benefits of fairly severe normalization out-weigh the disadvantages. From my research to date, it appears that the nearest there is to a standard for normalizing bibliographic information is work done by NACO, the Name Authority Cooperative Program of the Program for Cooperative Cataloging [2]. There appears to be a relatively well-defined process for ‘NACO normalizing’ a bibliographic record. This is described in some detail in a paper by Thomas Hickey et al. [3]. There is also sample code available in both python and java [4], though it appears to be subject to somewhat restrictive license terms.

For details of how NACO normalization works, the reader is referred to Hickey et al.‘s paper, but the core ideas are:

  1. all text is converted to lower case;
  2. most punctuation is removed;
  3. all diacritics (accents) are removed, together with the character they modify;
  4. multiple leading and trailing whitespace is removed.

To me, 1, 2 and 4 all seem, while ugly, to confer some obvious benefits. I can also see a reasonably strong case for removing diacritics (accents), in term of increasing the likelihood of correct matches. I am, however, amazed and horrified at the idea that not only should diacritics be removed, but the letter(s) they modify should be discarded also. This seems to me to be not so much of questionable benefit as positively harmful. I should perhaps have added to the embryonic requirements for the perfect about tag:

  • Comprehensibility: as well as being easy to work out what the appropriate about tag for a given object is, it should be fairly easy to work out the identity of the object to which a given about tag corresponds.

I therefore propose a NACO-like normalization that attempts to follow the same rules as NACO, except that it retains wherever possible the letters modified by accents. So é, è and ê all become e, æ becomes ae, ø becomes o and so forth. I realise this will be painful to those whose languages make more extensive use of diacritics than does English, and for whom an ø is no more reducible to an o than is a ≠ to an =, and can only plead base pragmatism in my defence. (I have an implementation of this which I will add to the abouttag package previously mentioned when I start publishing to FluidDB with it; but part of the point of this post is to see whether anyone has different ideas; so I’m not pushing it yet.)

With this, our normalized form for Fugutive Pieces becomes

Fugutive Pieces, by Anne Michaels:
book:fugitive pieces (anne michaels)

(Note that the title and author are normalized, but the entire tag is not, so the colon and parenetheses survive.)

To list a few others, we get:

La Bête Humaine, by Émile Zola:
book:la bete humaine (emile zola)
Z, by Vassilis Vassilikos:
book:z (vassilis vassilikos)
Pudd’nhead Wilson, by Mark Twain:
book:puddnhead wilson (mark twain)
One Hundred Years of Solitude, by Gabriel García Márquez:
book:one hundred years of solitude (gabriel garcia marquez)

Other Issues: Multiple Authors, Subtitles, Editions, Editors etc.

This proposal has been far too quickly thrown together to be remotely definitive, but I can make a few suggestions regarding some of the obvious issues that that will come up.

Multiple Authors

While it is extremely rare for novels to have more than one author, non-fiction frequently does. So I simply propose to list them in order, separated by semi-colons, with a space. Thus:

The Feynman Lectures on Physics, by Richard P. Feynman, Robert B. Leighton and Matthew Sands:

from abouttag.books import book
print book(u'The Feynman Lectures on Physics',
           u'Richard P. Feynman', u'Robert B. Leighton',
           u'Matthew Sands')

book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew sands)

Initials vs. Names and Forename Surname vs. Surname, Forename

I propose quite simply that we use whatever the authors use on their books; it seems that almost all authors have a standard form for their name in publications, and even where they don’t (e.g. Iain M. Banks vs. Iain Banks) the distinction is careful and meaningful. I further propose that we use put forename or initials before surname, and that initials are separated by spaces. So:

Lady Chatterley’s Lover, by D. H. Lawrence:

print book(u"Lady Chatterley's Lover", u'D.H. Lawrence')
book:lady chatterleys lover (d h lawrence)

Subtitles

Subtitles are part of the title, so should just be included, in my view. Unfortunately, the normalization will strip out any colon, but in doing so it will standardize. So:

Gödel, Escher, Bach: An Eternal Golden Braid, by Douglas R. Hofstader:
print book(u"Gödel, Escher, Bach: An Eternal Golden Braid",
           u'Douglas R. Hofstader')

book:godel escher bach an eternal golden braid (douglas r hofstader)

Editions

It seems to me that the usual state of affairs is that we don’t care so much about editions; but sometimes we do. Once again, if you want to be more specific, add in the edition. I suggest only that if you plan to use both, edition should precede volume (q.v.).

The Oxford English Dictionary, (second edition), editors: John Simpson & Edmund Weiner

print book(u'The Oxford English Dictionary: second edition',
           u'John Simpson', u'Edmund Weiner')
book:the oxford english dictionary second edition (john simpson; edmund weiner)

Volumes

Most commonly, when I rave about The Feynman Lectures on Physics, it is the entire opus that I am discussing; but sometimes I actually want to refer to a particular volume. This works perfectly in the scheme: in the usual course of events, omit the volume; if you specifically want to refer to volume I, put it in:

The Feynman Lectures on Physics, (Volume I) by Richard P. Feynman, Robert B. Leighton and Matthew Sands:

print book(u'The Feynman Lectures on Physics, Volume I',
           u'Richard P. Feynman', u'Robert B. Leighton',
           u'Matthew Sands')

book:the feynman lectures on physics volume i (richard p feynman; robert b leighton; matthew sands)

or, combining edition and volume:

The Oxford English Dictionary, (second edition) Volume 3, editors: John Simpson & Edmund Weiner

print book(u'The Oxford English Dictionary: second edition, volume 3',
           u'John Simpson', u'Edmund Weiner')

book:the oxford english dictionary second edition volume 3 (john simpson; edmund weiner)

Editors

Rather deliberately, the parenthetical authors don’t actually specify that they are authors. So I would just use them identically for editors. As with the OED above.

4 comments:

  1. looks good... a couple tweaks...

    "An manifestation — ad edition, by a publisher; it transpires that ISBN numbers identify expressions (editions), not works or expressions.

    A item — the 294-odd pages between soft covers sitting on the desk in front of me is an item — a physical book; an instance of a manifestation, in geek-speak."

    i think should read...

    "A manifestation — an edition, by a publisher; it transpires that ISBN numbers identify manifestations (editions), not works or expressions.

    An item — the 294-odd pages between soft covers sitting on the desk in front of me is an item — a physical book; an instance of a manifestation, in geek-speak."



    And I applaud the digging into FRBR.

    I'm not convinced of the normalization of book title though - I suspect there is something that will require less parsing/shaping... not unlike APA or Chicago style for citations...

    More directly, I suspect you'll see ISBNs used and then some meta information laid atop them (same as, see also, etc) to aggregate. ISBN is the low hanging fruit here and ripe for use - even if (and I agree), not the 'right' answer.

    ReplyDelete
  2. Thanks, Terrell. Spot on with the corrections, of course. Fixed now, I think.

    As you can probably tell, I'm not sure about the normalization either. i think I am fairly convinced that *some* normalized version of title plus author is quite a good scheme, though. At first, I liked ISBN, and of course, I agree, it's there, people will use it, and I'm sure you're right about same-as tags or whatever. But since learning just how specific an ISBN reference is, I've gone off them quite a lot. There's also a practical problem, which is that they're not that easy to find. I have 1,000 books (metadata) in digital form, but I haven't a good programmatic service to let me find the ISBNs yet. So while ISBN is shorter, I'm not sure it's easier. I'd quite like to put up a web service that normalized about tags based that people could just use through a web service or a web page...might actually be easier than ISBN.

    It'll be interesting to see how it plays out.

    Thanks again.

    ReplyDelete
  3. Good stuff. ISBNs, while useful in some instances, pretty much suck for anything else. As you've no doubt determined, they're designed and optimized for the identification of _saleable_ items. The idea being that if a customer asks for a specific ISBN, the item you order for them will be the exact same one they're expecting: same cover art, same binding, et cetera.

    As a side effect, I think ISBNs were also invented in order to create a mechanism through which a small cadre of central players can exercise significant control over the publishing industry as a whole, and extract fees thereby. Did you know, for instance, that ISBNs have to be _renewed_ every year for something like $25 a pop? It's a non-trivial expense for a small publisher who produces--purely hypothetically you understand--ePub, Kindle, and paperback versions of their titles.

    And you're right, it's totally at the wrong level of hierarchy for any other purposes. I love what you've outlined here, if for no other reason than you've obviously put considerable thought into getting it right (which I respect) and becuase you've saved me the bother of doing so.

    What we need now, IMHO, is a campaign to get FluidDB identifiers recognized within the industry as a better (free; more flexible; etc) identifier than the ubiquitous ISBN. We should find someone inside Amazon who can programmatically create well-formed FluidDB entries for everythign in their catalog. Hm...

    ReplyDelete
  4. Thanks, Jason. I didn't know about ISBN renewal issues on top of everything else.

    Believe me, there's nothing we'd like more than to get Amazon to push their catalogue into FluidDB. The idea would be they'd use an amazon.com namespace (for example) and could add pointers, price etc. It would be fabulous. Maybe it will even happen.

    On a smaller scale, Nicholas Tollervey (@ntoll) imported all of BoingBoing's articles the other day and we're hoping BoingBoing might start routinely dumping everything into FluidDB too. If a few people start doing that and we gain some momentum, it could go anywhere...

    ReplyDelete

Labels