30 January 2011

fdb.py version 1.28

I just pushed a minor revision of fdb.py to github.

One of the tests was failing as a result of an API change (deleting a tag that isn't on an object now causes returns a 204 (NO CONTENT) rather than a 404 error).

All tests should pass again now. Thanks to Joseph Marques for pointing out the problem.

19 January 2011

The Music of FluidDB I: Albums, Tracks and Songs

I have been thinking for a while about what conventions for tagging kinds musical entities in FluidDB. The kinds of things I have in mind include recordings of music, pieces of music (compositions), artists and composers. My firmest conclusion so far is that it’s complicated and I can’t tackle it all in one go.

In particular, classical music feels very complicated to me, with a common situation for a classical “record” being recordings of several pieces with somewhat variable names, often by different composers, being played often by a somewhat fluid and ambiguous collection of musicians.

In this post, therefore, I’m going to try to tackle what feels like a simpler problem by restricting myself to considering non-classical music and three kinds of entities—albums, tracks and songs.

My basic suggestion is to adopt conventions very similar to those I have been championing for books, in the form of the book-1 convention.

Books (Recap)

Recall that book-1 convention for about tags for books in English has the following basic components:

  • the prefix book:
  • the title of the book, normalized using NACO-like conventions, which standardize to lower case, remove most punctuation and accents and regularize spacing;
  • the author, again normalized in a NACO-like manner, in parentheses.

For example, Alice in Wonderland, by Lewis Carroll, uses the about tag

book:alice in wonderland (lewis carroll)

So far this convention seems to have worked quite well. Its virtues include:

  • it is simple to construct with only easily available information (the stuff you can see if you have the book or a normal reference to it)
  • it is unique for the almost all books
  • it is clearly identified as a book (and thus disambiguated from a film, for example).

The next stage beyond a single-author book is multi-author books, and there the convention is simply to list the authors, in the order they appear on the book, separated by semicolons. For example, The Feynman Lectures on Physics, by Richard P. Feynman, Robert B. Leighton and Matthew Sands uses the about tag:

book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew sands)

Albums, Tracks and Songs

Recorded non-classical music consists primarily of albums—a named collection of tracks, normally purchased together—and individual tracks, sometimes known as singles or songs.

At the simplest level, the conventions I am going to propose for about tags for albums and tracks are very similar to those for books but using the prefixes album: and track:. So the album, The Dark Side of the Moon, by Pink Floyd, is

album:the dark side of the moon (pink floyd)

and the track The Great Gig in the Sky, from that same album, is

track:the great gig in the sky (pink floyd)

But there are number of points to discuss.

Albums

The suggested about tag for albums is fairly straightforward. The main complication/ambiguity I can see concerns multi-volume sets. So, on vinyl, for example, Neil Young’s Decade has three disks; and it is a double CD. This is quite an easy case: I think we ignore the ‘disk’ number entirely where an just regard double and triple albums as albums. So all of Decade is:

album: decade (neil young)

For multi-volume collections that are normally sold separately, simply include the volume number. So, for example, The Tatum Group Masterpieces Volume 1, by Art Tatum, Benny Carter, Louis Bellson, becomes

album:the tatum group masterpieces volume 1 (art tatum; benny carter; louis bellson)

The NACO-like normalization conventions were described in this post and are implemented in the abouttag library.

The handling of artists is in principle quite simple, though in practice slightly hard to automate completely. My suggestion is that whenever there is a list of musicians, as with authors, they are simply separated with semicolons (and a space); any ampersands or ands are removed. In the case of groups, the group name is simply used. The interesting and slightly troubling cases are those where a group combines with person. The most common case of this is exemplified by Diana Ross and the Supremes. My suggestion is that such cases are left intact, other than normalization, using ‘and’ rather than ampersand (&). So the album “Reflections” becomes

album:reflections (diana ross and the supremes)

There are probably awkward corner cases, but I think this handles most.

The biggest problem I foresee is that it will hard to automate the construction of the standard form of an artist from something like iTunes metadata because the input (from Gracenote) doesn’t separate out a list of artists in any remotely consistent way, so I think standardizing them will require a degree of human intervention. This is not, however, in any way particular to this suggested convention; it’s fundamentally to do with the fact that some artists identified as a list of people, and others have a group name, and telling these apart is hard, even without complication such as the band Alice Cooper!

Here are a few examples of the sorts of album about tags I’m suggesting:

  • The Black Balloon, by John Renbourn album:the blank balloon (john renbourn)
  • The Composer, by Thelonious Monk album:the composer (thelonious monk)
  • Fleetwood Mac, by Fleetwood Mac album:fleetwood mac (fleetwood mac)
  • Wu Wei, by Pierre Bensusan album:wu wei (pierre bensusan)
  • The Tatum Group Masterpieces Volume 1, by Art Tatum, Benny Carter, Louis Bellson album:the tatum group masterpieces volume 1 (art tatum; benny carter; louis bellson)
  • Ms. Right, by Duck Baker album:ms right (duck baker)
  • ‘Round About Midnight, by The Miles Davis Quintet album:round about midnight (the miles davis quintet)
  • A Matter Of Time, by Gordon Giltrap & Martin Taylor album:a matter of time (gordon giltrap; martin taylor)
  • Musiques / Solilaï, by Pierre Bensusan album:musiques solilai (pierre bensusan)
  • Live Au New Morning, by Bensusan & Malherbe album:live au new morning (bensusan; malherbe)
  • Eye To The Telescope, by KT Tunstall album:eye to the telescope (k t tunstall)
  • Grace & Danger, by John Martyn album:grace & danger (john martyn)
  • Alas, I Cannot Swim, by Laura Marling album:alas i cannot swim (laura marling)
  • Lady In Autumn: The Best Of The Verve Years, by Billie Holiday album:lady in autumn the best of the verve years (billie holiday)

Tracks

I was originally minded to suggest using song: as the prefix for individual album tracks, notwithstanding the fact that this is slighty inappropriate for instrumental pieces. This was until I realised that we will certainly want to have entries for songs themselves (independent of artist) in FluidDB. Given this, I think we have little choice but to fall back to track, which is more perhaps more appropriate anyway.

I think there are couple of points to made about tracks. The first is that I do not propose to tie them to albums. Thus if an artist records a track (piece/song), I suggest that in the common case we don’t distinguish between different records. When you talk about Billie Holiday’s recording of God Bless the Child, you actually talk about all her records of that song, in the general case.

track:god bless the child (billie holiday)

Similarly, if, as is quite common, a track is qualified by (live) or [live], I suggest that be omitted in the standard case.

The other reasonably common complication, particularly for folk music, is the medley. In this case, my suggestion is just hand the track name to the NACO-like normalization routine and use what it produces. In most cases, this works fine.

To try to illustrate lots of common cases, here is a fairly long list of examples:

  • Rhythm-a-Ning, by Thelonious Monk track:rhythm a ning (thelonious monk)
  • Round Midnight, by Thelonious Monk track:round midnight (thelonious monk)
  • Straight, No Chaser, by Thelonious Monk track:straight no chaser (thelonious monk)
  • Bourrée I and II, by John Renbourn track:bourree i and ii (john renbourn)
  • Medley: The Mist Covered Mountains of Home / The Orphan / Tarboulton, by John Renbourn track:medley the mist covered mountains of home the orphan tarboulton (john renbourn)
  • Monday Morning, by Fleetwood Mac track:monday morning (fleetwood mac)
  • Poussière d’Amants, by Pierre Bensusan track:poussiere damants (pierre bensusan)
  • Doherty’s - Return to Milltown - Tommy People’s, by Tony McManus track:dohertys return to milltown tommy peoples (tony mcmanus)
  • Jackie Coleman’s - The Milliner’s Daughter - Rakish Paddy - Connor Dunn’s, by Tony McManus track:jackie colemans the milliners daughter rakish paddy connor dunns (tony mcmanus)
  • Blues in C, by Art Tatum, Benny Carter, Louis Bellson track:blues in c (art tatum; benny carter; louis bellson)
  • S’Wonderful, by Art Tatum, Benny Carter, Louis Bellson track:swonderful (art tatum; benny carter; louis bellson)
  • Makin’ Whoopee, by Art Tatum, Benny Carter, Louis Bellson track:makin whoopee (art tatum; benny carter; louis bellson)
  • (I’m Left With the) Blues in my Heart, by Art Tatum, Benny Carter, Louis Bellson track:im left with the blues in my heart (art tatum; benny carter; louis bellson)
  • The Nine Maidens a. Clarsach b. The Nine Maidens c. The Fiddler, by John Renbourn track:the nine maidens a clarsach b the nine maidens c the fiddler (john renbourn)
  • Ms. Right, by Duck Baker track:ms right (duck baker)
  • ‘Round Midnight, by The Miles Davis Quintet track:round midnight (the miles davis quintet)
  • Ah-Leu-Cha, by The Miles Davis Quintet track:ah leu cha (the miles davis quintet)
  • Across The Pond, by Gordon Giltrap & Martin Taylor track:across the pond (gordon giltrap; martin taylor)
  • G & T Blues, by Gordon Giltrap & Martin Taylor track:g & t blues (gordon giltrap; martin taylor)
  • Abide With Me / Old Gloryland, by Stefan Grossman & John Renbourn track:abide with me old gloryland (stefan grossman; john renbourn)
  • Badhra, by Anouar Brahem, John Surman, Dave Holland, track:badhra (anouar brahem; john surman; dave holland)
  • Biodag Aig Mac Thomais/The Nine Pint Coggie/The Spike Island Lasses, by Tony McManus track:biodag aig mac thomais the nine pint coggie the spike island lasses (tony mcmanus)
  • Three Pieces By O’Carolan;The Lamentation Of Owen Roe O’Neill; Lord Inchiquin; Mrs Power (O’Carlan’s Concerto), by John Renbourn track:three pieces by ocarolan the lamentation of owen roe oneill lord inchiquin mrs power ocarlans concerto (john renbourn)
  • Heman Dubh, by Pierre Bensusan track:heman dubh (pierre bensusan)
  • Le Voyage pour L’Irelande, by Pierre Bensusan track:le voyage pour lirelande (pierre bensusan)
  • 50 Ways To Leave Your Lover, by Paul Simon track:50 ways to leave your lover (paul simon)
  • La Danse Du Capricorne 1, by Pierre Bensusan track:la danse du capricorne 1 (pierre bensusan)
  • Reels - “The Pure Drop”/”The Flax In Bloom”, by Pierre Bensusan track:reels "the pure drop" "the flax in bloom" (pierre bensusan)
  • Mille Vallées, by Bensusan & Malherbe track:mille vallees (bensusan; malherbe)
  • Bamboo Shoot (Improvisation), by Bensusan & Malherbe track:bamboo shoot improvisation (bensusan; malherbe)
  • Black Horse And The Cherry Tree, by KT Tunstall track:black horse and the cherry tree (k t tunstall)
  • Universe & U, by KT Tunstall track:universe & u (k t tunstall)
  • Sigmund Freud’s Impersonation Of Albert Einstein In America, by Randy Newman track:sigmund freuds impersonation of albert einstein in america (randy newman)
  • Mr. President (Have Pity On The Working Man), by Randy Newman track:mr president have pity on the working man (randy newman)
  • I Love L.A., by Randy Newman track:i love l a (randy newman)
  • The Blues, by Randy Newman track:the blues (randy newman)
  • Through-Us-All, by Isaac Guillory track:through us all (isaac guillory)
  • A Terrible Pickle, by Dean Friedman track:a terrible pickle (dean friedman)
  • Money, by Pink Floyd track:money (pink floyd)
  • Take Five, by Dave Brubeck Quartet track:take five (dave brubeck quartet)
  • Pirates (So Long Lonely Avenue), by Rickie Lee Jones track:pirates so long lonely avenue (rickie lee jones)
  • The Returns, by Rickie Lee Jones track:the returns (rickie lee jones)
  • Chuck E’s In Love, by Rickie Lee Jones track:chuck es in love (rickie lee jones)
  • Harry’s House/Centerpiece, by Joni Mitchell track:harrys house centerpiece (joni mitchell)
  • I’s A Muggin’ (Rap), by Joni Mitchell track:is a muggin rap (joni mitchell)
  • Miles Beyond, by Mahavishnu Orchestra track:miles beyond (mahavishnu orchestra)
  • A Surfer Courted Me, by Martha Tilston and the Woods track:a surfer courted me (martha tilston and the woods)
  • Lookin’ On, by John Martyn track:lookin on (john martyn)
  • The Captain And The Hourglass, by Laura Marling track:the captain and the hourglass (laura marling)
  • Le Chien Sur Les Genoux de la Devineresse, by Anouar Brahem, Barbaros erkose, Kudsi Erguner & Lassad Hosni track:le chien sur les genoux de la devineresse (anouar brahem; barbaros erkose; kudsi erguner; lassad hosni)
  • A Prayer, by Madeleine Peyroux track:a prayer (madeleine peyroux)
  • Was I?, by Madeleine Peyroux track:was i (madeleine peyroux)
  • (I Got A Man Crazy For Me) He’s Funny That Way, by Billie Holiday track:i got a man crazy for me hes funny that way (billie holiday)
  • Lover Man (Oh, Where Can You Be?), by Billie Holiday track:lover man oh where can you be (billie holiday)
  • St. Louis Blues, by Billie Holiday track:st louis blues (billie holiday)

Songs

[UPDATE 2011/01/19: I have modified this recommendation since it was first posted, after thinking more about the lack of consistency in how composers are identified.]

I have given less thought to songs (as distinct from tracks, or recordings of songs), but the obvious convention would seem to be to use the song: prefix, followed by the normalized song title, followed by the composer or composers in brackets, again in whatever order they are normally listed. The only real complication I can see there is the fairly common case in which music and lyrics are given separate credits. In that case, I think I suggest simply listing the music composer ahead of the lyrics composer.

The slightly subtle question concerns ow to standardize the composer’s name. I the case of artists (and authors) my normal recommendation is to start from the name as it appears on the work, so John Martyn, J. D. Salinger etc. This works well because you just have to look at the work to see how it is written; and for this reason, there’s a well-defined, standard place to look (the work).

Composers are more awkward, because it is much less clear where to look. If you own a record, the easy thing to do is to look at the sleeve, or the liner notes, or sometimes on the record (or CD) itself. But the same song can be recorded many times and the composer won’t always be displayed consistently. You could also look at the sheet music. Or in Wikipedia. In short, there is no consistency. A quick look through the first half dozen make it clear there’s not even consistency on a single CD in many cases.

In this case, therefore, my recommendation is to use surnames only. So in a simple case, Summertime by George Gershwin, is

song:summertime (gershwin)

The Lennon/McCartney partnership would produce, for example

song:hey jude (lennon; mccartney)

A case in which lyrics and music are credited separately would be Officer Krupke, from Westside Story, by Leonard Bernstein (music) and Stephen Sondheim (lyrics). So this would be:

song:office krupke (bernstein; sondheim)

The reason I’ve gone for surname only is that it seems to involve very little loss of precision (it will be rare indeed for two songs with the same title to have different composers with the same surname but different forenames), and to use the smallest amount of information that is commonly available. I think this is probably a fairly good convention.

Comments Invited

As ever, I’d be interested in thoughts from anyone, in the blog comments or directly. I haven’t pushed an updated version of the abouttag library containing these to github yet, but will probably do so in a few days unless there is significant push-back.

15 January 2011

DRM, Readability and the Ownership of Bits

Summary

It is not only pirates and thieves who should be deeply worried by DRM.

We need legal protection of our bits (our digital information).

As we rush (inevitably) towards digital everything, there are real risks of losing everything. As a society, we should be putting in place disaster recovery plans for our culture, using safer and longer-established technologies such as paper.

DRM

The debate on DRM too often seems to be a private argument between two kinds of parasites. In the red corner, we hear endlessly from the consumer parasites, people who think they have some God-given right to unpaid access to anything that anyone creates. In the blue corner, we hear equal and opposite special pleading from parasitic dinosaur executives whose only real wish is to keep suckering the public into paying them yet again, as the gatekeepers, for works they’ve already bought in several different formats without ever actually gaining any kind of right to content. But DRM concerns us all; and should be of concern to all of us.

Officially, DRM stands for Digital Rights Management. But this is a peculiarly Orwellian inversion. From the majority perspective, as Richard Stallman points out, the term is more easily understood if the word “rights” is replaced with “restrictions”. For DRM systems are not munificent helpers designed to make protect the ability of the purchaser to use that which she has purchased. They are not helpful gofers that will search out a suitable decoder to allow you to use your legally purchased content on a different computer. DRM systems are not dedicated to ensuring that your DVD will play in any DVD player in the world if you have legally purchased the right to watch the DVD. They are not like escrows, designed to ensure that even if the company that sold you the content goes out of business, your ability to utilize that content will not be compromised. If DRM were any of these things, the word “rights” might be appropriate. In reality, the only rights DRM systems have any concern for at all are the “rights” of the “rights” holders—the record companies, film studios and publishers, rather than the artists, in most cases—to restrict, disable, and destroy.

If you doubt the malice of DRM systems and the umbrella of related technologies they exemplify, Kindlegate lays bare the toxic nature of the systems that are increasingly invading our lives.

Kindlegate and the Unselling of Orwell

AllSalesAreFinal.png

In the summer of 2009, Amazon Kindle customers who had purchased George Orwell’s Nineteen Eighty Four and Animal Farm found, one day, that these works had disappeared from their Kindles. This was not the result of some horrible software glitch. Amazon deleted them. Amazon deliberately deleted them. Amazon, having taken these people’s money, and sold them a “Kindle Edition” e-book, decided to unsell them. [1] It simply reached into these readers’ Kindles and erased the relevant bits. Clearly there is no “all sales are final” notice on amazon.com. Amazon did not ask its Kindle customers to delete them for a refund (though it did refund them). Amazon did not produce a warrant and come round with police officers. It just reached a digital arm into its customers’ property and took them. The irony could only have been more complete if Farenheit 451 had also been deleted.

In “justifying” this action, Amazon said that the works by Orwell had been pulled because the Kindle publisher did not own the rights.

“When we were notified of this by the rights holder, we removed the illegal copies from our systems and from customers’ devices, and refunded customers.”

— Drew Herdener, a spokesperson with Amazon.com, quoted by Bobbie Johnson in The Guardian, 17th July 2009

Amazon’s CEO, Jeff Bezos, is no fool, and sounds as if he was almost as horrified as I and others were by this action by Amazon. He wrote this, which is better than I would expect from almost anyone else in his position.

This is an apology for the way we previously handled illegally sold copies of ‘1984’ and other novels on Kindle. Our ‘solution’ to the problem was stupid, thoughtless and painfully out of line with our principles. It is wholly self-inflicted, and we deserve the criticism we’ve received. We will use the scar tissue from this painful mistake to help make better decisions going forward, ones that match our mission.

With deep apology to our customers,

— Jeff Bezos, CEO & Founder, Amazon.com, Kindle Community, 23rd July 2009.

It’s a pretty good statement and rings true to me. But it doesn’t say what Amazon should have done. It doesn’t say what Amazon will do next time a similar situation arises. More importantly, It doesn’t promise never again to reach into people’s devices and reset their bits without their permission. The statement is pretty good for a CEO in Bezos’s position; but it still falls woefully short of providing any real reassurance.

The Threats to our Bits, our Culture, our Knowledge

I do not believe there is any future in opposing or resisting the migration of almost all forms of creative, cultural and intellectual products to digital form. This is partly because I think the benefits of digital works on balance outweigh the downsides. But it is also because I think the transition to digital forms is a historical inevitability. Just as we cannot uninvent the bomb, we cannot uninvent digital music, text or images, and happily, we have much less reason to do so.

We can, however, think carefully about how we maximize the benefits from digital technologies and ameliorate the concomitant risks. We can use legal, financial and societal measures to try to ensure our digital future is bright.

As we move from storing books and music on paper and vinyl to storing them as configurations of bits on various kinds of computer memory, we face various significant risks, both as individuals and as societies—even as a species.

Broadly, the threats to our information come from disasters and accidents, from (illegal) sabotage, from legalized sabotage, from technical obsolescence and from insouciance and the inability or unwillingness of our law-makers to challenge vested interests.

Owning our Bits

As ever more of that which we value migrates from the domain of matter—whose properties, durability and character we have learned over millennia—to the digital domain of ones and zeroes, with which we have only short experience, we need to start treating bits rather more seriously. We have learned over centuries to increase the durability of non-digital information, both through physical measures (acid-free paper, stable inks, reprints and so on) and legal protections (freedom of speech, public libraries, central libraries, fair use doctrines and so forth).

Whatever the unread, unreasonable and unjust licence agreement that Kindle owners presumably assent to may say, it is outrageous that Amazon could have any kind of right willfully to destroy information on its customers’ devices. It is wrong. It would not happen in the domain of matter. Had I purchased a paperback copy of Nineteen Eight Four from Amazon, and the company had subsequently discovered that Penguin Books had in fact infringed copyright by printing and selling the book to Amazon, this would not have given Amazon the right to come into my house, retrieve the book and leave a fiver in the hall on the way out by way of compensation. In the domain of atoms and matter, we have long established ways of dealing with these sorts of situations, and they tend to involve the police, courts, warrants, discussion and notions of natural justice. I can think of nothing in the established world of atoms even vaguely similar to Amazon’s casual unselling of its Kindle editions.

So the first thing we need is for it simply to be illegal for Amazon or anyone else to alter information on our devices (or, for that matter, in our “cloud” storage) without either our explicit, meaningful and uncoerced consent, or the benefit of some suitable instrument of law (by which, for avoidance of doubt, I do not mean a clickthrough licence agreement).

DRM: Tame it or Kill It

The state of DRM systems is a kind of Wild West where companies can and do try almost anything. Sony has been installed rootkits (a form of malware) on computers as part of its DRM efforts; DVDs are deliberately “region encoded” so that they can only be played on certain players; music is tied to a particular computer and can only be transferred to a limited number (if any) of others; printers will only use ink from cartridges with the right code on a chip; some e-books from Amazon cannot be read aloud by the dalek-voiced Kindle because the audio right is not included with the ebook.

Mummy, will you read me a story please?

Certainly dear. What story would you like?

Thomas the Tank Engine! Thomas the Tank Engine!

Oh, I’m sorry, but your Thomas the Tank Engine books don’t include the audio right so I can’t read that one to you. How about the King James edition bible? There’s no DRM on that. Or the GNU Public License, perhaps? I’m sure there wouldn’t be a problem reading that aloud!

Even as basic an operation as backing up a DVD is prohibited, as is ripping it to play it on an iPad or a phone. Legally purchased DVDs are so painful to use that on the rare occasions I have bought them I have sometimes given up before getting to the film. Some others I know purchase DVDs and then watch the painless torrent. This widely shared imgur image summarizes the absurdity of today’s situation well.

http://i.imgur.com/GxzeV.jpg

Microsoft’s “Genuine Advantage” program so annoyed customers who had purchased legitimate licenses for Microsoft products but found themselves unable to use them that Microsoft has now quietly dropped the scheme.

Where will it end?

Lessons from Music

As everyone knows, DRM didn’t go to plan for the music industry, and interestingly is now seems to be disappearing. It was always odd. The industry was terrified by the threat of perfect digital copies proliferating without money flowing back to them, so for years resisted selling music “digitally”. Except that it didn’t. The record industry happily sold high-quality, non-DRM infected music all the time they claimed the skies would fall in if they did just that. It’s just they sold them in the reassuringly familiar and solid form of matter. They sold their bits as CDs.

We will never know what would have happened if instead of burying its head in the ground, adopting DRM, dragging its heels, suing its customers and making up numbers about lost sales that not even a fool a hurry would be taken in by, the record industry had embraced the inevitable and started doing something more like Apple did with iTunes. My guess is that it would probably be in considerably better shape than it is now.

It amazes me that the book industry, having watched what happened in music, seems to have drawn the conclusion that DRM is the way forward. I think it is deluded, and that things will probably play out much as they did in music, i.e. in a few years time the DRM will get dropped, and the sky will not fall in. There will be some piracy, probably more that there was in print, just as there is in music. But there will also be savings and long-tail benefits, and as with music, eventually the book industry will realise that, in general, people understand that writers and their support infrastructures need to be paid; as long as e-books are convenient and reasonably priced and of high quality, most people will be happy to pay for them.

I have the highest respect and regard for those who produce the music, the books the films and art that form the bedrock of our culture, and very much want to see them properly rewarded. One of the exciting possibilities arising from new technologies is that artists can do more with less, and can retain more direct control with smaller support structures. It is quite possible for the digital revolution to provide more income for creators and their aides while lowering prices overall or tolerating a degree of piracy. The DRM systems being peddled, mostly not by creators, but by those who end up distributing their works, are wholly disproportionate; their insidious effects and potential for abuses even more severe than we have seen thus far are too serious for us to ignore. They also carry long-term risks, such as information becoming unusable as hardware and software changes and as companies go out of business. This danger also exists with open formats, but is far smaller and dramatically easier to protect against.

For our own sakes, our children’s sakes, and our culture’s sake, we need, as a society, to face up to the serious dangers that are intrinsic to DRM and its fellow travellers.

[1]There’s no word “unsell” you say? There is now.

03 January 2011

100 Bestsellers in FluidDB: So What?

tdvc.png

This evening I published another hundred books to FluidDB. This time it was a list of the 100 best-selling books of the last 12 years, as published by the ever-wonderful Guardian Data Store. I published them as a table, mostly using the conventions documented in this post. If you’re using a modern browser (almost anything other than Internet Explorer) you can see a visualization of the FluidDB object for the table at abouttag.com/butterfly/about/table:bestsellers-1998-2010 and the best-selling book (Dan Brown’s The Da Vinci Code, depressingly) at abouttag.com/butterfly/about/book:the da vinci code (dan brown).

There’s a tag on each book that hyperlinks to the next, so you you could even click through all hundred if you really wanted.

So What?

Why should you or anyone else care that I’ve published this data to FluidDB? After all, the Guardian made the data available on Google docs, so anyone can do anything with it anyway. What’s the benefit of having it in FluidDB? I’m going to try to show a few things that might convince you there’s something interesting about putting this sort of data in FluidDB.

1. Query

The most obvious thing is that you can query data in FluidDB from anywhere with internet access without even having an account. For example, the query to find the best-selling book from the list in FluidDB is this:

miro/bestsellers-1998-2010/rank = 1

You can issue this query from anything that can talk to FluidDB and it should return the object corresponding to The Da Vinci Code, by Dan Brown. Here are a few ways of doing just that.

  • You can use the FluidDB Explorer and just paste the query into the query box. It should locate the object (identifying it by its about tag, which is book:the da vinci code (dan brown) and also by its FluidDB ID, which is e7fee95f-4dcd-458b-8893-b56352d455cf. If you then click on either the about tag or the object ID, the explorer will give you a list of tags on the object, and tell you there are too many to show (which isn’t really true). If you then click ‘Load all tag values’ it will get them and show them to you.

  • You can use my python library fdb, which has a command line tool with it and type:

    fdb show -q 'miro/bestsellers-1998-2010/rank = 1' /miro/bestsellers-1998-2010/title /miro/bestsellers-1998-2010/author

    This produces the following output:

    1 object matched
    Object e7fee95f-4dcd-458b-8893-b56352d455cf:
      /miro/bestsellers-1998-2010/title = "The Da Vinci Code"
      /miro/bestsellers-1998-2010/author = "Dan Brown"
  • You can use curl at the command line (which is a utility installed on most systems by default) and type

    curl 'http://fluiddb.fluidinfo.com/values?query=miro/bestsellers-1998-2010/rank%3D1&tag=miro/bestsellers-1998-2010/title&tag=miro/bestsellers-1998-2010/author'

    which produces:

    {
      "results":
      {
        "id":
        {
          "e7fee95f-4dcd-458b-8893-b56352d455cf":
          {
            "miro/bestsellers-1998-2010/author": {"value": "Dan Brown"},
            "miro/bestsellers-1998-2010/title": {"value": "The Da Vinci Code"}}
        }
      }
    }
    

    (I’ve reformatted this slightly, but otherwise this is the exact output from FluidDB.).

  • You could even just use the query directly in your browser’s URL bar. Pasting the following into the address bar should work in almost all browsers

    http://fluiddb.fluidinfo.com/values?query=miro/bestsellers-1998-2010/rank%3D1&tag=miro/bestsellers-1998-2010/title&tag=miro/bestsellers-1998-2010/author

    again, producing:

    {
      "results":
      {
        "id":
        {
          "e7fee95f-4dcd-458b-8893-b56352d455cf":
          {
            "miro/bestsellers-1998-2010/author": {"value": "Dan Brown"},
            "miro/bestsellers-1998-2010/title": {"value": "The Da Vinci Code"}}
        }
      }
    }
    

    In quite a few browsers, even the following will work:

    http://fluiddb.fluidinfo.com/values?query=miro/bestsellers-1998-2010/rank=1&tag=miro/bestsellers-1998-2010/title&tag=miro/bestsellers-1998-2010/author

2. More interesting queries

Regular readers of this blog will recall that I previously published a rather larger set of 1,000 books to FluidDB. These were again originally from the Guardian (though pre-dated the data store/data blog) and this time consisted of the the Guardian’s 1,000 novels that everyone must read. (See this post and this post for details.)

So an obvious question is: which of that original Guardian 1,000 books are in the 100 bestsellers of the last 12 years? The following FluidDB query will tell you:

has miro/books/guardian-1000 and has miro/bestsellers-1998-2010/title

(If I’d picked my tags better, this query would have been even simpler, but it’s not bad.)

As an illustration, if I issue that query, again asking for author and title, I get the following (using fdb):

fdb show -q 'has miro/books/guardian-1000 and has miro/bestsellers-1998-2010/title' /about /miro/books/title /miro/books/author

7 objects matched

Object ce180ce3-29b5-4abc-a031-64015b162f6a:
  /fluiddb/about = "book:birdsong (sebastian faulks)"
  /miro/books/title = "Birdsong"
  /miro/books/author = "Sebastian Faulks"

Object a2fa68ae-d409-422f-887a-dbdb7c1b4f18:
  /fluiddb/about = "book:atonement (ian mcewan)"
  /miro/books/title = "Atonement"
  /miro/books/author = "Ian McEwan"

Object d5ff7995-2ae6-4ba8-8549-ea1d0726484c:
  /fluiddb/about = "book:the kite runner (khaled hosseini)"
  /miro/books/title = "The Kite Runner"
  /miro/books/author = "Khaled Hosseini"

Object c64aeced-1505-4bb3-ab8a-0ce4c6a70ba3:
  /fluiddb/about = "book:white teeth (zadie smith)"
  /miro/books/title = "White Teeth"
  /miro/books/author = "Zadie Smith"

Object 7e076540-3e14-4232-8c46-13863bae77ec:
  /fluiddb/about = "book:the curious incident of the dog in the night time (mark haddon)"
  /miro/books/title = "The Curious Incident of the Dog in the Night-Time"
  /miro/books/author = "Mark Haddon"

Object 5be745bd-500d-458b-b4e6-dd08972b73f6:
  /fluiddb/about = "book:to kill a mockingbird (harper lee)"
  /miro/books/title = "To Kill A Mockingbird"
  /miro/books/author = "Harper Lee"

Object 3b416fa5-51ab-4160-9820-240a0591c3a2:
  /fluiddb/about = "book:the time travelers wife (audrey niffenegger)"
  /miro/books/title = "The Time Traveler's Wife"
  /miro/books/author = "Audrey Niffenegger"

(I’ve added some blank lines, but otherwise this is the raw output from fdb.)

Or perhaps I’d like to know all the books that sold over 2,000,000 copies. For that, the relevant FluidDB query is just:

miro/bestsellers-1998-2010/volume > 2000000

Again, illustrating with fdb, and this time asking only the for the about tag that FluidDB uses to identify the object, we get this (faintly depressing) list:

 fdb show -q 'miro/bestsellers-1998-2010/volume > 2000000' /about8 objects matched

Object b2ff54a0-d94e-4fe1-951f-a4bd839ba219:
  /fluiddb/about = "book:harry potter and the half blood prince childrens edition (j k rowling)"

Object e7fee95f-4dcd-458b-8893-b56352d455cf:
  /fluiddb/about = "book:the da vinci code (dan brown)"

Object 04033298-9be8-41b8-b9ef-d1b11b1adfb9:
  /fluiddb/about = "book:harry potter and the philosophers stone (j k rowling)"

Object 60c5bbea-2568-4a68-825f-ffc4cfb20f88:
  /fluiddb/about = "book:harry potter and the prisoner of azkaban (j k rowling)"

Object 9258d0da-a65a-471b-abdb-277b68ea1ea0:
  /fluiddb/about = "book:harry potter and the chamber of secrets (j k rowling)"

Object 04a4b407-7f21-450b-83c3-2d840ef6a133:
  /fluiddb/about = "book:deception point (dan brown)"

Object 23d5a20a-ba28-43b8-9265-2afd8c4019ee:
  /fluiddb/about = "book:twilight (stephenie meyer)"

Object 36bc89a5-d91c-4cdf-9389-ff2fbe833d59:
  /fluiddb/about = "book:angels and demons (dan brown)"

3. Combining Data Sources

One of the things that is really interesting about this example is to look at the seven books that overlap. For example, Audrey Niffenegger’s wonderful book, The Time Traveller’s Wife is on both lists. A core idea of FluidDB is that different information comes to be associated by being placed on the same FluidDB object. The about tag (fluiddb/about) can be used to choose the object. In the case of novels, that object is identified [1] by an about tag of the form book:title (author)—in this case, book:the time travelers wife (audrey niffernegger). Obviously, there’s room for ambiguity with case and punctuation etc., but there’s a library and a website that will sort most of that out for you.

When I uploaded data on the Guardian 1000 books, (as the miro user) there wasn’t all that much information—author, title, year and the fact that it was on the Guardian 1000 list is pretty much all that was there. For example, here is what Aldous Huxley’s Brave New World looks like:

bnw.png

(Live version here)

In the case of the best-sellers, the dataset contained a bit more information including sales volume, publisher, average selling prices and total sales value.

The marvellous thing is that books that are on both lists automatically get all the data from both sources, simply because they both chose the same FluidDB objects (e.g., the one with the about tag book:the time travelers wife (audrey niffernegger), which you can see live here in a modern browser), or as it is at the time I write:

tttw.png

When I published the second list, I found that it included books that I had already rated in FluidDB. For example, I had already (personally, as njr) rated Small Island, by Andrea Levy, and as a result, when (as the miro user) I published the list of bestsellers, my njr/rating was already on some of them.

I think this is a powerful example of the potential of FluidDB, one that would be even more potent if it had been someone other than I (albeit as Miró) who had published to the at least one of two lists previously. But the point is, anyone following the convention about where to put data about books in FluidDB could equally easily have published the data with the same result. As usage of the system increased, we will see this more and more.

Go Explore; Go Tag

This post just scratched the surface, but I hope it begins to show the real and tangible benefits of publishing data to FluidDB. The data becomes capable of being queried. Multiple data sources combine, sometimes in ways that had not been foreseen. It can be accessed visually, from a command line, or programmatically. And you can add your own data, whether it be annotations, ratings, comments, associations or whatever.

So go explore; and if you like, get an account and start tagging/publishing.

[1]Regular readers will know that I am a rather strong advocate of conventions for about tags in FluidDB in general, and for this convention for books in particular. Anyone can publish any data to FluidDB using any objects or conventions they like; but, as this post illustrates, there are real benefits when different datasets concerning overlapping things use common conventions.

02 January 2011

Update to Abouttag (app and library)

I’ve updated both the abouttag library and the Abouttag app to do a bit more normalization of books.

The changes are as follows:

Relocation of Articles (The, A)

It is quite common for titles to be presented with articles at the end, after a comma, to facilitate alphabetical sorting. For example, The Catcher in the Rye would often be written as Catcher in the Rye, The. Although slightly less common, this also happens with the indefinite article, so that A Stitch in Time would become Stitch in Time, A.

The library now has a function, move_article that will move such articles to the front, so that all of:

  • The Catcher in the Rye
  • Catcher in the Rye, The
  • Catcher in the Rye,The

become

  • The Catcher in the Rye

Similarly, all of

  • A Stitch in Time
  • Stitch in Time, A
  • Stitch in Time,A

become

  • A Stitch in Time.

NOTE: At the moment, only the english articles ‘a’ and ‘the’ are relocated, but the code is quite general and uses a list. At some point, I will probably extend the list, and if you download the library it will be trivial for you to do so. The main thing that stopped me from at least adding French was the case of l’ (e.g. l’Alchimiste), which might require fractionally more thought than I want to give it immediately.

Authors

There is also a function move_surname_to_end that will move surnames to the end of names (where detectable) and also regularize initials. So the following variations of J. D. Salinger

  • J. D. Salinger
  • J.D.Salinger
  • J.D. Salinger
  • JD Salinger
  • Salinger,J.D.
  • Salinger, J.D.
  • Salinger, J. D.
  • Salinger, JD

all map to

  • J. D. Salinger

NOTE: Initials with accents are not standardized at present. This would be a fairly simple change, which I expect I will make, but would require a slightly different approach. Relocation should work fine, even with accents.

About Tags

The about tag construction function book uses both of these mappings, so that now, for example, if you have the latest version of the library, the following will work.

$ python
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from abouttag.books import book
>>> print book(u'Catcher in the Rye,The', u'Salinger,J.D.')

book:the catcher in the rye (j d salinger)

These changes seem to me like general improvements to the library, making it more likely people will converge on the same object for a book. I made the changes today, specifically, because the Guardian has just published a list of the 100 best-selling books of the last 12 years (1998–2010). As you might guess, the list presents all titles with articles at the end and authors with surnames before forenames/initials. Depressing though the list is in many respects, I will probably upload the data to FluidDB later; this will be easier with the new version of the library.

Labels