02 January 2011

Update to Abouttag (app and library)

I’ve updated both the abouttag library and the Abouttag app to do a bit more normalization of books.

The changes are as follows:

Relocation of Articles (The, A)

It is quite common for titles to be presented with articles at the end, after a comma, to facilitate alphabetical sorting. For example, The Catcher in the Rye would often be written as Catcher in the Rye, The. Although slightly less common, this also happens with the indefinite article, so that A Stitch in Time would become Stitch in Time, A.

The library now has a function, move_article that will move such articles to the front, so that all of:

  • The Catcher in the Rye
  • Catcher in the Rye, The
  • Catcher in the Rye,The

become

  • The Catcher in the Rye

Similarly, all of

  • A Stitch in Time
  • Stitch in Time, A
  • Stitch in Time,A

become

  • A Stitch in Time.

NOTE: At the moment, only the english articles ‘a’ and ‘the’ are relocated, but the code is quite general and uses a list. At some point, I will probably extend the list, and if you download the library it will be trivial for you to do so. The main thing that stopped me from at least adding French was the case of l’ (e.g. l’Alchimiste), which might require fractionally more thought than I want to give it immediately.

Authors

There is also a function move_surname_to_end that will move surnames to the end of names (where detectable) and also regularize initials. So the following variations of J. D. Salinger

  • J. D. Salinger
  • J.D.Salinger
  • J.D. Salinger
  • JD Salinger
  • Salinger,J.D.
  • Salinger, J.D.
  • Salinger, J. D.
  • Salinger, JD

all map to

  • J. D. Salinger

NOTE: Initials with accents are not standardized at present. This would be a fairly simple change, which I expect I will make, but would require a slightly different approach. Relocation should work fine, even with accents.

About Tags

The about tag construction function book uses both of these mappings, so that now, for example, if you have the latest version of the library, the following will work.

$ python
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from abouttag.books import book
>>> print book(u'Catcher in the Rye,The', u'Salinger,J.D.')

book:the catcher in the rye (j d salinger)

These changes seem to me like general improvements to the library, making it more likely people will converge on the same object for a book. I made the changes today, specifically, because the Guardian has just published a list of the 100 best-selling books of the last 12 years (1998–2010). As you might guess, the list presents all titles with articles at the end and authors with surnames before forenames/initials. Depressing though the list is in many respects, I will probably upload the data to FluidDB later; this will be easier with the new version of the library.

No comments:

Post a Comment

Labels