30 December 2011

Lies, Damned Lies and Progress Indicators

Here are some you may have met. (Click start, in a modern non-Microsoft* browser, on each one.)

The perfect progress indicator (rarely seen in the wild).

START DONE predicted actual

The software installer progress bar

START DONE predicted actual

The software uninstaller progress bar

START DONE predicted actual

The media-download progress bar.

START DONE predicted actual

The software update progress bar.

START DONE predicted actual

No wonder so many cop out and use one of these:

http://fluiddb.fluidinfo.com/about/abouttag/njr/image/red-spinner.gif

* [Why non-Microsoft? Before going all HTML5 and SVG on this blog I made a point of testing things with Internet Explorer 9 and was delighted to find that everything seemed to work fine there. Unfortunately, I didn’t think to test SMIL-based SVG animation, which works (you guessed it) on Chrome, Safari, Firefox, Opera, iPhone, iPad and even newer Androids, but not, in fact, Internet Explorer 9.]

27 December 2011

The British Library Catalogue / British National Bibliography

I have added to Fluidinfo information on approximately 2.5 million books drawn from the roughly 3 million records in the British National Bibliography, which documents the British Library’s Catalogue.

As ever, I have used the book-u convention (implemented using the Python abouttag library) to select about tags for the objects, and have tagged the books in Fluidinfo under the book user. Data specific to the British National Biography (BNB) is stored in the namespace book/bnb, while more generic data (derived from the information contained in the Bibliography) is stored directly in the book namespace.

Here is an example of a book that has been augmented with data from the British National Library. The book is George Orwell’s Animal Farm, and it is illustrated using the About Tag visualizer. (If you can’t see the picture below, upgrade to the latest version of your browser or see here for information on why you might be having trouble.) The green tags are the new ones.

fluidinfo 1529c459- f3f2- 45e1- 90f4- 3ff3040ad6df alice/comment="So disappointing." alice/has-read alice/likes=False alice/rating=2 bert/comment="What a book: I love it!" bert/has-read bert/rating=8 book/author="George Orwell" book/bnb/contributors={} book/bnb/creator="Orwell, George, 1903-1950" book/bnb/id={"GB9689279", "GBB005647", "GB8416414", "GBA0Y6010", "GB9330497", "GB7301513"} book/dewey={"823.912", "823/.912", "823/.9/1"} book/isbn={"070898200X", "185715150X", "0582275245", "0582434475", "0435121650", "978141...} book/r=0.170613118849 book/source={"BNBrdfdc13.xml-201011150#088316", "BNBrdfdc13.xml-201011150#029879", "BNBrdf...} book/title="Animal farm" fluiddb/about="book:animal farm (george orwell)" girafind/books/author={"George Orwell"} girafind/books/language="["$_english"]" girafind/books/title="Animal Farm" miro/books/author="George Orwell" miro/books/forename="George" miro/books/guardian-1000 miro/books/surname="Orwell" miro/books/title="Animal Farm" miro/books/year=1945 miro/class="record" njr/index/about njr/rating=10 otoburb/has-read otoburb/rating=8

Notice that, because of the careful normalization inherent in the book-u convention, where the book is already in Fluidinfo, the new data has generally been added to the existing object corresponding to that book, as in the case above.

The core data that should almost always be present is:

  • the about tag fluiddb/about, normalized using the book-u convention:

    book:animal farm (george orwell)

  • the book/author tag, containing the best author information I was able to extract, in this case

    George Orwell

    Where there is more than one author, they are generally shown separated by commas, with the last joined with an and (with no Oxford Comma). For example, The Feynman Lectures on Physics, by Feynman, Leighton and Sands has

    $ fish show -a 'book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew l sands)'
    /book/author
    
    Object with about="book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew l sands)":
      /book/author = "Richard P. Feynman, Robert B. Leighton and Matthew L. Sands"

    or, graphically:

    fluidinfo aeaa654c- 35b0- 4b00- 866b- c7deda8959c4 book/author="Richard P. Feynman, Robert B. Leighton and Matthew L. Sands" book/bnb/contributors={"Leighton, Robert B.", "Sands, Matthew L. (Matthew Linzee)"} book/bnb/creator="Feynman, Richard P. (Richard Phillips), 1918-1988." book/bnb/id={"GBA901036", "GBA645628"} book/dewey={"530"} book/isbn={"0805390499", "0805390669"} book/r=0.893236319082 book/source={"BNBrdfdc16.xml-201011150#096269", "BNBrdfdc14.xml-201011150#132062"} book/title="The Feynman lectures on physics" fluiddb/about="book:the feynman lectures on physics (richard p feynman; robert b leighton; m..."

    The book/author tag has had a lot of processing done to it, as described below.

  • the book/title field, which is usually almost identical to that in the BNB data. In this case it is:

    Animal farm

    I have not altered the capitalization, which is therefore generally consistent with some entry in the BNB database (though I would really prefer it were in Title Case).

  • the book/source tag shows where the base data was taken from. This tag’s value is a set of strings, each of which corresponds an entry in one of the 17 files from which the BNB data was extracted. The entries consist of

    • the name of the file (always BNBrdfdcNN.xml) where NN runs from 01 to 17
    • a dash -
    • the datestamp on that file (always 20101115 at present)
    • the digit zero (0) and a # sign
    • the record number in the file, starting from 1, with six digits.

    Since multiple bibliographic entries can correspond to the same work, there is sometimes more than one of these.

  • the book/r tag is a pseudo-random floating point value with 0.0 ≤ book/r < 1.0.

Some of the raw data has also been added, with almost no cleaning up, under the book/bnb namespace. The BNB data uses the Dublin Core metadata standard, and includes:

  • bnb/creator, which is the person or organization primarily responsible for the creation of the work. This is sometimes blank, and is stored as a single string value.
  • bnb/contributors, which is a list of contributors, sometimes including the creator and sometimes not.
  • bnb/dewey is the set of Dewey Decimal classifications found on the records corresponding to this book.
  • bnb/isbn is the set of international standard book numbers found on the records corresponding to this book.
  • bnb/id is the set of British Library IDs found on the records corresponding to this book. (I’m not entirely clear what this identifier is, but it appears to be important and well populated.)

Other information is available in the data (including classification information), and I will probably extract this and add it at a later time.

Finding, Inspecting and Tagging Books in Fluidinfo

There are multiple ways of retrieving book data from Fluidinfo and of tagging it.

  • Probably the easiest and most general method is to go to http://artoftagging.com and do a search that involves a book and some keywords from the title and/or author. A list of results should come back and you can see a visualization of any of them by clicking the link If you have a Fluidinfo account, you can create an account at artoftagging.com and then save your Fluidinfo details there. Once logged in, you will then be able to add your own tags to any object you find.

  • If you just want to construct the about tag for a book, you can do that using the online version of the Fluidinfo Shell, Fish. Once there, type, for example:

    fish> about book "Animal Farm" "George Orwell"
    book:animal farm (george orwell)
    
    fish> about book "The Feynman Lectures on Physics' 'Richard P. Feynman"
    "Robert B. Leighton" "Matthew L. Sands"
    book:the feynman lectures on physics (richard p feynman; robert b leighton; matthew l sands)

    (The quotes tell Fish that “Animal Farm” is the title and “George Orwell” a single author.) Alternatively, you can download and install Fish on your own machine. (It is available from Github.) You can then type the same commands, after fish, e.g.:

    $ fish> about book "Animal Farm" "George Orwell"
    book:animal farm (george orwell)

    You can then use any Fluidinfo tool, including the new Object Browser, to work with that object, signing in with Twitter if you like.

  • Another easy way of finding an about tag for a book is to find it on Amazon (US or UK, for now) and use the az-fish bookmarklet available at the top of the online Fish (drag it to your browser’ toolbar). The bookmarklet will take the item on the current Amazon page and issue the appropriate Fish command to find the about tag. (You don’t need to log into Fish or Fluidinfo to do this.)

The Hierarchy of Books: Works and Manifestations

The International Federation of Library Associations (IFLA) describes a hierarchy of four kinds of “book” entities in its report Functional Requirements for Bibliographic Records. These are:

  • works
  • expressions
  • manifestations
  • items.

Quoting from that report:

“The entities defined as work (a distinct intellectual or artistic creation) and expression (the intellectual or artistic realization of a work) reflect intellectual or artistic content. The entities defined as manifestation (the physical embodiment of an expression of a work) and item (a single exemplar of a manifestation), on the other hand, reflect physical form.”

Loosely, a work is the conceptual book, usually described by the combination of a title and author—Animal Farm by George Orwell.

The report describes an expression of a work as “the intellectual or artistic realization of a work in the form of alpha-numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms.” Thus George Orwell’s Animal Farm can be translated into different languages, laid out differently, typeset on pages, or in digital form, or recorded as spoken words, and these correspond to different expressions of that same book. There may also be different editions, printings etc., which may have slightly different content. Again, these are different expressions of the same conceptual work. (Occasionally, expressions may encompass several works, such as in the case of compendia.)

Moving down the hierarchy, a manifestation is a particular rendering of a work into physical form — “the physical embodiment of an expression of a work.” Note that “[A]s an entity, manifestation represents all the physical objects that bear the same characteristics, in respect to both intellectual content and physical form.” Thus, all the copies of the same printing of the same edition of Animal Farm that are essentially indistinguishable collectively correspond to a manifestation of George Orwell’s Animal Farm.

Finally, an item is an individual copy of a book: “a single exemplar of a manifestation.”

The entries in the British Library’s catalogue correspond literally to items, but conceptually to manifestations, but the objects to which I have attached the data in Fluidinfo correspond to works. This is why the c. 3 million records reduce to c. 2.5 million Fluidinfo objects, and why some of the objects have multiple ISBNs etc. It is entirely possible to create further objects at the level of manifestations (and even items, if someone really wants to do so), and even more so at the level of expressions, but I have not done this yet.

The reason I have concentrated on works rather than manifestations is that this seems much the most important level to represent in a system like Fluidinfo: with important exceptions, when people want to rate or comment on a book, it is most often the work, rather than the manifestation, that they are interested in. Moreover, collecting together information about the different ISBNs associated with a single work is positively helpful. That is not to say that there isn’t a case for creating other objects at the level of expressions or manifestations.

Further Work

There is a great deal more that can be usefully done with the fabulous data from the British Library. While I am not committing to doing these, tasks on my list list include:

  • Authors. Creating an object corresponding to each creator/author/contributor. I plan to use about tags of the form author:normalized name (birth-year) for these, e.g. author:George Orwell (1903). The required data is largely available in the BNB dataset. I would then plan to add a book/related-authors tag to each book, pointing to its authors’ objects and, on the author objects, corresponding sets of book/related-books tags pointing back to their works.

  • Upload Checking. Checking the everything uploaded OK. I count 2,558,738 unique books (as works) in the BNB dataset, and I appeared to upload all of these successfully (getting HTTP 204 statuses back from Fluidinfo). However, when I count objects having a book/r tag, I get only 2,468,661, a shortfall of 90,077.

    Whether this indicates a problem or not is unclear, as if I count the number of books with a book/source but no book/r, with the query

    has book/source except has book/r

    it reports 18,921 such books, but as far as I can tell, all those it finds in fact have a book/r, so it appears that Fluidinfo is having some difficulty executing some queries correctly at the moment.

  • About Tag Checking. I had to use some fairly hairy code to coerce the BNB data into the correct form to generate canonical about tags in the book-u convention, and it has definitely failed in some cases. For example, I have seen at least one example where the surname of an author in the BNB data preceded the forename but without a comma, so that forename and surname will have been reversed. To the extent that I can detect these problems, I will try to fix them.

  • Recent additions. I believe the British Library has issued updates with recent additions (since November 2010); I certainly plan to get that data and import it in a similar fashion, and then to set up a CRON job to do that regularly. In this way, I hope the dataset will be living and always current.

  • Categorizations. The BNB data includes subject categories for the records, which I have not imported thus far. I will do so.

  • Year information. There is information about publication dates in the BNB data, but it is not in a very structured form. If I am able to extract it with a satisfactory degree of reliability, I will get this too. Obviously, different manifestations will have different publication dates, so this will probably be a set-valued tag.

Enjoy the data, and let me know if you find problems.

I expect I will write a number of other posts on issues associated with this data.

About Tag Goes HTML5 with Embedded SVG: Browser Requirements

This page uses HTML5 and an embedded Scalable Vector Graphics (SVG) diagram, and acts as a test. From this point on (27th December 2011) my plan is to use HTML5 with embedded SVG as the default format for posts, and therefore if you use a browser (or feed reader) that does not support this, you will miss out.

You should see an elegant diagram below. If you not, it probably means that you are not using a modern, standards-compliant HTML5 Browser.

fluidinfo aeaa654c- 35b0- 4b00- 866b- c7deda8959c4 book/author="Richard P. Feynman, Robert B. Leighton and Matthew L. Sands" book/bnb/contributors={"Leighton, Robert B.", "Sands, Matthew L. (Matthew Linzee)"} book/bnb/creator="Feynman, Richard P. (Richard Phillips), 1918-1988." book/bnb/id={"GBA901036", "GBA645628"} book/dewey={"530"} book/isbn={"0805390499", "0805390669"} book/r=0.893236319082 book/source={"BNBrdfdc16.xml-201011150#096269", "BNBrdfdc14.xml-201011150#132062"} book/title="The Feynman lectures on physics" fluiddb/about="book:the feynman lectures on physics (richard p feynman; robert b leighton; m..."

I have tested this page on the following:

  • A Macbook Pro running OS X 10.7.2 (Lion) with Safari 5.1.2, Chrome 16.0.912.63, Firefox 8.0.1, Opera 11.6
  • A Mac Pro running OS X 10.6.8 (Snow Leopard) with the same browsers.
  • An iPad running iOS 5.0.1 (in Safari, naturally)
  • An iPhone running iOS 5.0.1 (again, Safari, of course)
  • A Sony Vaio running Windows 7 with Internet Explorer 9
  • A prehistoric Dell latitude laptop running Chrome 16 and Firefox 9

and everything looks good. It does *not* work with Internet Explorer 8, but then, what does? It also does not work with many older versions of Firefox, Chrome, Safari or Opera. So this is a good time to upgrade. I don't know which Android or Linux growers it works with, but I would guess it will work with some.

I imagine the diagrams will not show up in most feed readers, and that is unfortunate, but I think this is a time to push forward, so that is what I am doing.

Among the other marvellous benefits of SVG, it is scalable (just like the S says), which means that if you zoom the browser (generally command-plus on macs, and control-plus on Windows) the diagram will scale too, without looking terrible. Wonder of wonders.

19 December 2011

Fish 4.18 Released: Aliases, Sequences, Duck Typing and More

It has been an unconsionably long time since I last pushed a version of Fish to Github. The reason for this is mostly that, while adding various features, I broke a few things and wanted to get them all fixed before inflicting them on people. I believe they are now fixed (but do disabuse me of this notion if you discover otherwise.)

There is much that is new, though almost all changes are backwards compatible. Everything is documented in the new green-themed documentation, available (as usual) in Fluidinfo itself at

http://fluiddb.fluidinfo.com/about/fish/fish/index.html

The main changes are as follows:

  • You can omit the -a or -i on the tag, untag, show, get and tags commands. If you do so, Fish will use the first argument instead, assuming that if it looks like a UUID, it is a UUID, and if not, that it isn’t. For these purposes, UUIDs must be expressed with lower-case hex digits and include the dashes in 88888888-4444-4444-4444-cccccccccccc formation. So now the following both work where before they would have generated errors:

    $ fish show Paris rating /about /id
    Object with about="Paris":
      /njr/rating = 10
      /fluiddb/about = "Paris"
      /id = "17ecdfbc-c148-41d3-b898-0b5396ebe6cc"
    
    $ fish show 17ecdfbc-c148-41d3-b898-0b5396ebe6cc rating /about /id
    Object 17ecdfbc-c148-41d3-b898-0b5396ebe6cc:
      /njr/rating = 10
      /fluiddb/about = "Paris"
      /id = "17ecdfbc-c148-41d3-b898-0b5396ebe6cc"

    Needless to say, the longer -a and -i forms work, and are useful if you want to tag multiple objects in one command (since they may be repeated).

  • The tags command now displays the tags in alphabetical order, except for the about tag, which is always listed first.

  • Fish now supports simple aliases, which effectively allow you to add commands to Fish. A simple example is:

    fish alias eiffel 'show -a "Eiffel Tower"'

    which allows commands like eiffel rating to be used in place of show -a "Eiffel Tower" rating.

    Aliases are stored in Fluidinfo, with private tags on objects whose about tag is the name of the alias. For example, with the alias definition above, the object with about tag paris has a tag njr/.fish/alias added to it with its value set to the expansion text for the alias:

    $ fish alias paris
    paris:
      njr/.fish/alias = "show -a "Paris""

    (Obviously, the quoting here is slightly unfortunate; I will fix that some time.)

    Aliases are also cached locally in the file-system; the cache is updated from Fluidinfo using the new sync command or whenever Fish is entered in interactive mode (by typing Fish).

    Because aliases are stored in Fluidinfo, they can be shared between multiple copies of Fish, and also with the online version Shell-Fish.

    The cache can be viewed with showcache.

  • Support for sequences has been added. Sequences provide a convenient way of storing a numbered collection of items that are added to over time. They are described in a previous blog post (Sequences in Fluidinfo).

    Briefly, in the simplest case, a sequence of remarks is defined by saying:

    $ fish mkseq remark

    This creates two new aliases:

    • remark is used to add a new remark, using the alias

      $ fish alias
      remark:
        njr/.fish/alias = "seq /njr/remark"
    • remarks is used to look at (or search) remarks, using the alias remarks:

      njr/.fish/alias = "listseq /njr/remark"

    Thus if we say:

    $ fish mkseq remark
    Next remark number: 0
    
    $ fish remark "Isn't this a remarkable first remark"
    0: Isn't this a remarkable first remark
    2011-12-18
    
    $ fish remark "...and this only slightly less remarkable"
    1: ...and this only slightly less remarkable
    2011-12-18

    then we will see:

    $ fish remarks
    0: Isn't this a remarkable first remark
    2011-12-18
    
    1: ...and this only slightly less remarkable
    2011-12-18

    By default, sequences are public, but you can easily make them private by specifying a tag in a private namespace (typically private); you can also specify the plural form. To set up the sequence as private, and use myremarks to list and search remarks, you would instead say:

    $ mkseq remark remarks private/remark

    For more details, see the previous blog post or the documentation.

  • Non-primitive types are now shown more sensibly (by show and tags). Previously, Fish would attempt to print even non-primitive types, with sometimes unfortunately consequences both in terms of the volume of output and its effects on terminals. For non-primitive types, output is now shown as below:

    $ fish show fish /fish/index.html
    Object with about="fish":
      /fish/index.html = <Non-primitive value of type text/html (size 8907)>
  • Display of set-valued tags is also improved, e.g.:

    $ fish tags 'artist:led zeppelin'
    Object with about="artist:led zeppelin":
      /fluiddb/about = "artist:led zeppelin"
      /musicbrainz.org/artist
      /musicbrainz.org/artist/end-date = "1980-09-25"
      /musicbrainz.org/artist/members = {
        "Jimmy Page"
        "John Bonham"
        "John Paul Jones"
        "Robert Plant"
      }
      /musicbrainz.org/artist/name = "Led Zeppelin"
      /musicbrainz.org/artist/sort-name = "Led Zeppelin"
      /musicbrainz.org/artist/start-date = "1968-01-01"
      /musicbrainz.org/artist/type = "group"
      /musicbrainz.org/mbid = "678d88b2-87b0-403b-b63d-5da7465aecc3"
  • The Fish API has been updated to take account of the renaming of FluidDB to Fluidinfo, and various tests have been changed to use more esoteric unicode characters.

  • Some operations are faster (because more use is made of the /values endpoint).

  • Finally, when Fish starts it checks the environment for the presence of the variable FISHUSER. If this is defined, the credentials in the startup file identified by the string specified in FISHUSER will be used, rather than the default ones. (This is mainly helpful if you want to use Fish with different Fluidinfo accounts in different shells concurrently.) Thus, if FISHUSER is set to foo (on UNIX), the credentials from ~/.fluidDBcredentials.foo will be used, rather than those in ~/.fluidDBcredentials.

So, obviously, there are quite a lot of changes, and though I’ve been using it for a while, some things might have broken. (I fixed some bugs yesterday; always dangerous!)

15 December 2011

Fragmentation and URL Normalization

I have updated the abouttag.py library to use a new, better convention for normalizing URLs. The two main changes people will notice are:

  1. URLs that represent directories will now include, rather than exclude, a trailing slash:

    http://fluidinfo.com/

    rather than

    http://fluidinfo.com
  2. There is now a dependency on the excellent urlnorm.py, by Jehiah Czebotar.

The Issue: Fragmentation

The twin evils that the abouttag.py library and this blog exist to fight are fragmentation and overloading.

Fragmentation occurs in Fluidinfo when different users store information about the same thing on different objects, while overloading occurs when people store information about different things on the same object. In general, both of these are undesirable. Fragmentation reduces data sharing and makes it harder to extract information from the system, whereas overloading creates ambiguity and confusion.

One of the more common uses for Fluidinfo is for tagging web pages, and it is very natural to use the URL as the about tag, as almost everyone does. There is not much of a problem with overloading in this case (except to the extent that URLs point to web pages that change over time), but there is definitely fragmentation.

I would distinguish between two kinds of fragmentation in the case of URLs.

  1. Different representations of the same URL. Perhaps the most obvious example is the trailing slash on many URLs. Punctilious persons with good knowledge of W3C standards (and in particular RFC3986) prefer the inclusion of a trailing slash on URLs (and more generally, on URIs) where appropriate, and thus prefer

    http://fluidinfo.com/

    to the more colloquial

    http://fluidinfo.com

    Technically, these are different URLs, but web servers so routinely and uniformly redirect the latter to the former that they can be considered for all practical purposes the same. It seems highly desirable for any convention for about tags for URLs to map these two forms, along with other similar representational variants, to a common about tag.

  2. Different URLs that may or may not represent the same web page. The most obvious example of this is the www. that used to be de rigeur and is now commonly (but not reliably) redundant. Most right-thinking webmasters (webmistresses?) routinely redirect these to the same place, there is no general guarantee that the www. form (http://www.fluidinfo.com/) and the bare form (http://fluidinfo.com/) will produce the same page, nor even that they should both work.

    Standardizing this would therefore seem to be a normalization too far.

The Old and New Behaviour of abouttag.py

Fluidinfo is far from the only system with an interest in developing a canonical or normalized form for URLs. Search engines and social bookmarking sites (such as Pinboard and Delicious) work better if different URLs representing the same resource are collapsed, and as mentioned above, there is even a standard (RFC3986) for how to perform the canonicalization.

The relevant Wikipedia page describes six normalizations that preserve URL semantics. These are:

  • Converting the scheme and host to lower case. (HTTP://http:// and FLUIDINFO.COMfluidinfo.com).
  • Capitalizing letters in escape sequences (%3a%3A)
  • Decoding percent-encoded octets of unreserved characters (%7E~)
  • Adding a trailing slash where appropriate (http://fluidinfo.comhttp://fluidinfo.com/)
  • Removing the default port (http://fluidinfo.com:80/http://fluidinfo.com/)
  • Removing dot-segments (http://fluidinfo.com/accounts/./new/http://fluidinfo.com/accounts/new/)

Happily, libraries to perform these normalizations already exist and are freely for a number of programming languages, including Python. As noted above, Jehiah Czebotar’s urlnorm.py performs the task admirably in Python, so in the version of abouttag.py that I just pushed to Github (version 0.6) I have made added a new convention, uri-2, corresponding to this behaviour and have made that the default. So now:

>>> from abouttag.uri import URI

>>> URI(u'http://fluidinfo.com')
u'http://fluidinfo.com/'

>>> URI(u'HTTP://FLUIDINFO.com:80')
u'http://fluidinfo.com/'

>>> URI(u'HTTP://FLUIDINFO.com:80')
u'http://fluidinfo.com/'

>>> URI(u'http://fluidinfo.com/a/./b/?arg=%7Ealice')
u'http://fluidinfo.com/a/b/?arg=~alice'

This is different from the old behaviour, which can be obtained by explicitly adding a convention argument of ‘uri-1’:

>>> URI(u'http://fluidinfo.com', convention=u'uri-1')
u'http://fluidinfo.com'
# note no trailing slash

>>> URI(u'HTTP://FLUIDINFO.com', convention=u'uri-1')
u'http://fluidinfo.com'
# Same downcasing, but again no trailing slash

>>> URI(u'http://fluidinfo.com:80', convention=u'uri-1')
u'http://fluidinfo.com:80'
# uri-1 didn't strip default ports

>>> URI(u'http://fluidinfo.com/a/./b/?arg=%7Ealice', convention='uri-1')
u'http://fluidinfo.com/a/./b/?arg=%7Ealice'
# nor did it undo unnecessary %-encoding or strip . & .. path segments.

Both the new and the old versions perform one additional normalization, which is to add a leading http:// if no scheme is present in the input. This is not because there is not a distinction between a domain and a URL, but rather because by calling the URI function the user is clearly indicating that this is a URI, which requires a scheme, and http:// is clearly the appropriate default scheme:

>>> URI(u'fluidinfo.com')
u'http://fluidinfo.com/'

Why...?

The reader may be wondering why I did not adhere to the RFC previously, and issued forth older versions of the abouttag library with the altogether inferior behaviour of uri-1. Ignorance, pure and simple.

10 December 2011

Siri: The Command Line for Everyone Else

Perhaps the biggest difference between the way in which “real” people use computers and geeks use computers is this:

Real people use Graphical User Interfaces because they find them intuitive and efficient.

Geeks generally prefer the command line, which they find easier, more precise and faster than GUIs. For geeks, GUIs tend to get in the way, limit and interfere.

Here’s a GUI for my files:

Finder

It has all the advantages of clickability, a set of icons for actions, and a few extra things hidden on menus and right-click (“content”) menus. But it is very limited.

Here, in contrast, is a command line, in its spare, minimalist glory:

CommandLine

If you don’t know what to type, the command line is intimidating, unhelpful and limiting. But if you do know, you can do almost anything: far from limiting, the command line is open and alive with virtually unbounded possibilities. Instead of having to nagivate menus and finders and buttons and icons, the command line allows you to access almost anything the machine can do, all from one place, just by typing.

There are two major things that stop real people from benefitting from the command line and its liberating possibilities:

  1. People don’t know the commands or the right syntax for them.
  2. People don’t like typing. [1]

Enter Siri

Just like a command line, Siri has the potential to allow me to access anything my phone can do with no navigation: if I want to call Alex, I say “call Alex”. If I want to find out the height of Mount Everest, I just say “How high is Mount Everest?”. If I want to send a Tweet, I say “text Bird” and it will send a Tweet for me. (OK, this last one is a hack: if I say “Tweet this”, Siri knows exactly what I mean, but refuses saying “Sorry, Nicholas, I can’t help you with Twitter.” But I can send a tweet as a text message by saying “text Bird” because I have the Twitter short-code listed under Bird Parker. “Why not under Twitter?”, you ask? Because Siri still refuses if I list it under Twitter! Go figure!)

Of course, unlike the command line, I don’t have to get the syntax right with Siri. I just issue commands in plain English, and a reasonable proportion of the time it “understands” me.

Before Siri, the nearest thing to a syntax-free command line for real people was Google—a little box into which you can type anything in the reasonable hope that the search will return some relevant information. But Google is largely a one-trick pony, and even though it’s a good trick, it’s nothing like as powerful as when the software makes an attempt to understand the command and has the ability to take actions. (By offering a list of the top dozen or so “hits”, Google also hedges its bets, getting the user to pick the best-looking “answer”: quite apart from speech recognition and “comprehension”, Siri goes for broke by putting all its money on a single interpretation of what you said, only asking for clarification occasionally.)

Horace Dediu, his podcast Getting to Know You makes the case that the significance of Siri is that it allows Apple to learn much more about its users, allowing a new level of lock-in, power and service. That’s an interesting and important perspective that may prove to be right. But after a day with Siri, I think the more direct and immediate consequence is exactly that Siri could bring all the power of the command line to the masses.

[1]Actually, most geeks don’t really like typing either, and have myriad ways to reduce typing, from globbing (wild-card expansion) to command-line completion; but the basic point stands.

26 November 2011

Microsoft, iPads and the Innovator's Dilemma

This article was motivated by listening to Episode 44 of Hypercritical, the episode of John Siracusa‘s weekly podcast that focused on the question “What Ails Microsoft?”; listen to it for context. I agreed with most of Siracusa’s analysis, but thought he missed a few key insights and perspectives.

You should listen to Siracusa’s podcast, but here are some of the key points in his analysis of what ails Microsoft are:

  • Microsoft consistently refuses bet-the-company radical changes that will be good for the user and its own long-term business prospects because it are scared of damaging its cash cows (primarily Windows and Office, but also servers, Exchange etc.);
  • Microsoft serves primarily PC vendors, IT departments, backward-looking developers and perhaps Intel rather than its end-users; this leads to poor user experiences;
  • Microsoft follows rather than leads and so is always behind the curve (think Bing, XBox, Zune, Windows Phone etc.)
  • Microsoft underestimates its own position of strength, which would in fact allow it to upset its customers more (to everyone’s long-term benefit) for fear of losing what it has;
  • The demands of its core customers for a roadmap mean Microsoft always overpromises and underdelivers, has low marketing impact and never surprises competitors etc.
  • Apple is the reverse of all this, repeatedly taking bet-the-company risks, always focusing on the end user, repeatedly canibalising its own products, being secretive and never publishing roadmaps, constantly leading and redefining categories (without necessarily being first mover), all in manner of what Steve Denning calls Radical Management, which has led to its current position as the world’s most valuable company.

While I agree with most of these points, here is what I think Siracusa missed.

Clayton Christensen and The Innovator’s Dilemma

Clayton Christensen‘s The Innovator’s Dilemma is the best business book I’ve ever read. Unusually, it contains a thesis that can’t be reduced to a single sentence. His interest is in how great companies get overthrown by disruptive innovators. His key ideas are as follows:

  • Christensen defines a disruptive technology as one that is worse than the incumbent technology on the key metrics that are usually used to measure quality in that space, but better in some other, traditionally less important metrics.
  • Although he offers several examples, Christensen’s clearest example is disk drives. Here, the two traditional key metrics are speed and capacity. Disk technologies have been replaced in waves, first with 8” disks being replaced with 5.25” disks, then 3.5” disks, then 2.5” disks then 1” disks. (Solid-state drives are now gradually starting to replace rotating disks.)
  • Christensen argues that incumbent leaders almost always succeed with sustaining (non-disruptive) innovations that improve the performance of the technology against the standard metrics, but almost always fail to bring to market new disruptive technologies, even though these are often first developed by the market-leading company. He says this happens primarily because leading companies tend to be “well managed”, and are strongly influenced by their best customers and partners, who are, almost by definition, mostly bought into the existing metrics. So when, for example, Winchester (the leading 8” disk manufacturer) asks its customers “would you be interested in lower power, physically smaller disk that has lower speed and less capacity they say “no, that’s a terrible thing, we need speed and capacity”.
  • New entrants, often start-ups, see an opportunity to serve new markets, often consisting of people not using the incumbent technology, for whom the alternative metrics (in this case, size and power consumption) are more important than the traditional ones. For example, 8” disks didn’t work for PCs but 5.25” disks did; 5.25” disks didn’t work for laptops but 3.5” disks did (and then 2.5”); 2.5” disks didn’t work for iPods but 1” disks did. Now solid-state memory, which is fast but expensive/lower capacity, works for phones, cameras, tablets etc. in a way that even 1” disks didn’t.
  • A key point Christensen makes is that the new market, of non-consumption, is often unattractive to the incumbent leader, who typically sees it as small and offering low margins, but is highly attractive for newcomers, who typically hone themselves on lower margins as they serve it.
  • Over time, sustaining improvements to the new technology tend to improve it against the traditional metrics as well as the new ones: current 3.5” disks have much larger capacities and better latencies than did early ones. As they improve, they become more viable in increasing parts of the “old” market, and the old leader tends to be reduced to ever smaller, more niche parts of the market. Eventually, the new technology tends to get good enough for mainstream use and at this point the advantages of the new technology start to be more interesting to old customers. (“So I can enough speed and capacity, but with a smaller footprint, less power consumption and a lower price: well sure!”) If it survives at all, the previous leader ends up serving only the very high end where the extremes of the old metrics are required.

iPads, PCs and the Innovator’s Dilemma

Apple was not the first to come up with the idea of a Tablet PC. In fact, Alan Kay came up with many of the key ideas in his remarkable 1972 paper on the Dynabook. But in the more recent past, Microsoft (especially Bill Gates) championed tablet computers and brought them to market a decade before Apple built the iPad. Microsoft, however, saw a tablet, through the ever-present and distorting lens of its Windows cash cow, as an enhancement to a traditional Windows PC: you add a touch-screen (and a stylus) to traditional laptop running (of course) Windows and voilà, a tablet is born.

The iPad received a very luke-warm reception when it was launched, and was widely derided as (merely) a giant iPod Touch. It was criticized for being underpowered, closed, not running even standard Mac applications, let alone Windows software, not supporting “true” multi-tasking or windowing and more besides. Yet it quickly sold in the tens of millions and is clearly now replacing PCs for some people.

With some caveats, this fits Claytonsen’s model very well. The iPad is a worse general-purpose computer against the traditional metrics. It has slower hardware (though rarely feels slow), few ports, no user accounts, comparatively little storage, no hardware keyboard, limited, vetted software and (cough) no true multitasking, no Flash, no replacable battery and limited upgrade options.

But look at the alternative new metrics, that show all the ways in which it is better for some people and purposes. It is extremely small and light. It is fantastically easy to use. Thanks to Apple’s control-freakery, installing software is simple and worry-free. It has a touch screen. It has no significant issues with viruses etc. Its battery genuinely lasts over 10 hours even when you use the machine intensively. It has stores for software, books, music and videos built in (and it probably already knows your credit card number). It has numerous sensors (cameras, microphones, accelerometers, gyroscopes and more). Software for it tends to be really cheap and some of it is of fantastic quality. It is supremely relaxing to use.

For people who mostly surf the web, do light email, play games, watch films, read books etc., the iPad is not just a “good enough” alternative to a laptop or even a desktop PC: it may actually be signficantly better. The iPad 2 (and iOS 5) followed the pattern of sustaining improvements, both on the new metrics (usability, weight, size, sensors etc.) and the old (speed, capacity, ability to link to an external monitor, multitasking etc.).

Crucially, while Microsoft saw a tablet as a way to extend the PC, and added Touch features to Windows and made its tablet PCs full Windows PCs “with added Touch”, Apple redesigned all the upper layers of the operating system to give the best possible experience for the iPad as a new class of device. It didn’t worry about disrupting sales of its own Mac laptops, still less (naturally) those of Windows PCs: it just made the iPad as good as it could, in its own right.

Risk and Perfect 20-20 Hindsight

The other major point I feel Siracusa failed to make, and many people are missing, is that Steve Jobs’s and Apple’s Radical Management is a genuinely high risk strategy: it can fail as well as succeed, and frequently does so. I think we need to separate out two ideas that I feel are being conflated. The first is “betting the company” on an uncertain new thing, which isn’t necessarily a good idea for a leading company, but makes more sense for a struggling company. The second is the the aggressive development and marketing of new technologies that might canibalize your existing business; this probably is a good idea, even if the new business is lower margin or lower value, because almost certainly someone will do it, and it’s better for the leader to do it to itself than for a competitor to do so.

113/365: Flippin' coins

[Image: Flippin’ Coins , by Pauli Antero on Flickr, Creative Commons, some rights reserved.]

In terms of the risk side, a comparison I like to make is with finding lucky people. Contrast two situations. If I say to you “Give me a coin and I’ll toss it ten times and get heads each time”, and then I do it, you’ll probably think that is quite impressive and either very lucky or (more likely) manipulated. But if I take a thousand people and get each of them to flip a coin repeatedly, and after each round of flipping I get all the people who got tails to stop, after 10 rounds I might well have a single person who got a sequence of 10 heads. But there would be nothing odd about that, and it certainly doesn’t require the person to have special powers or be “lucky” (in any non-scientific sense).

Apple now has the largest market capitalization of any company in the world, with massive success and profits, after a series of audacious, high-risk moves that worked out. I’m not saying for a moment that this is pure luck, or that Apple is like the one-in-a-thousand kid who got ten heads in a row, but I do think that the world has annointed Apple after the fact when many other companies have made “audacious” moves that didn’t work out and, in some cases, sent them under. (Time-Warner’s merger with AOL was certainly audacious.) I think the modern Apple has made a series of good moves, and combined those with backtracking where necessary (allowing native apps, giving Final Cut Pro a stay of execution, allowing various apps and books into the stores after poorly judged bans etc.), and has won for reasons that combine skill and luck; but even if Steve Jobs had lived, that doesn’t mean he didn’t have more lemons in him, and that these might not have come out and eventually hobbled or even killed the company.

Even today, when Microsoft is routinely left off the list of the key tech companies (now typically Apple, Google, Facebook, Amazon), or is at least seen as the laggard among these, it remains massively profitable and powerful. I think it is in terminal decline, and deserves to be, but it is far from irrelevant yet.

02 November 2011

Everyone Needs a Little Privacy

I’m delighted to announce two new features in Fluidinfo—the existence of a private namespace for everyone, and the creation of a readable Fluidinfo version number. Thanks are due to Jamu (@jkakar) and Manuel (@ceronman) for implementing these changes. They were both my requests/suggestions, so blame me if you hate them.

The private namespace

If you have an existing Fluidinfo account, you will now find that you now have a top-level namespace called private. If you had such a namespace anyway, it won’t have been touched, but if you did not, it has been created, and all of its permissions have been set, as the name suggests, so that only you can access it.

Using Fish (or the online version at http://shell-fish.appspot.com), you can see this by using the ls command. So for me (logged in as njr), I can get a Unix-style listing of the permissions with:

$ fish ls -ld private
nrwc------   njr/private/

or a more detailed, Fluidinfo-style listing by saying

$ fish ls -Gd private
njr/private/:

NAMESPACE (/namespaces)
  Read
    list (read):        policy: closed; exceptions = [njr]
  Write
    create (create):    policy: closed; exceptions = [njr]
    update (metadata):  policy: closed; exceptions = [njr]
    delete (delete):    policy: closed; exceptions = [njr]
  Control
    control (control):  policy: closed; exceptions = [njr]

New users registering from now on will also have this namespace created for them, with the permissions set to private.

We have given everyone a private namespace by default for two main reasons:

  • It means that, without having to know anything about how the permissions system works, all Fluidinfo users have any easy way to create both private and non-private information: if you want your data to be public, by default all your tags not under the private namespace are readable by everyone (but writable only by you); where you want data to be private, just use a tag in your private namespace.
  • Just as importantly, we expect that many Fluidinfo applications will now place some data in a user’s private namespace (as appropriate) and some at the top level, or in other namespace that aren’t private by default. So an application to allow users to share clipboard between devices, for example, would probably use private tags by default, but a rating application would be more likely to use tags that are public by default.

Just as most people never change any file permissions, we expect that most Fluidinfo users will not want to be messing around with tag permissions, and this change means they should not have to do so.

FAQ

  • Q1: I hate this: I don’t want a private namespace! Can I get rid of it?

    A1: Of course. You can safely delete the private namespace. In Fish, assuming you haven’t added anything to it you can do this simply by saying:

    $ fish rm private

    (The fish part of all these commands can be omitted if you are using the online version of Fish.)

    Or, if you prefer, you can alter the permissions on the namespace to make it non-private. For example, if you say

    $ fish perms default private

    that will reset the permissions on the namespace private to the defaults (making it readable by everyone, but writable only by you). I wouldn’t particularly recommend this course, but it’s up to you.

  • Q2: I already had a namespace called private, but its permissions aren’t set correctly. How can I fix them up?

    A2: You can do this with Fish’s perms command:

    $ fish perms private private

    Note that this only changes the permissions on the private namespace itself. If you have tags or namespaces under it, their permissions will not be affected, but new tags and namespace that you create under private after this change will be affected. (I will add a -r option to Fish soon, for recursive permissions changes, but it isn’t there at the time of writing.)

  • Q3: I used private as a tag: have you just trashed all my private tags?

    A3: No. Fluidinfo is quite happy for you to have a tag and a namespace with the same name. If you had a private tag, this hasn’t been affected at all by the new private namespace.

  • Q4: Is everything under my private namespace private?

    A4: If we’ve just created the private namespace for you, yes; but if you had one already, it depends:

    A few months ago we made a change to Fluidinfo so that when a new tag or namespace is created, it inherits permissions from its parent namespace. So everything works as you would expect, and if you create new tags and namespaces under your private namespace, everything will work as any reasonable person would expect.

    It is important to note, however, that the Fluidinfo permissions system is not hierarchical, in the sense that changing the permissions on a parent namespace has no effect on its existing child namespaces and tags.

  • Q5: Will I run into problems if I buck the trend and don’t have a private namespace called private or if I make other tags and namespace private?

    A5: No. It is obviously possible that Fluidinfo applications you use might make assumptions that your private namespace is private, or that it exists, or even that tags not under your private namespace are public, but such apps would be making unjustified assumptions.

    There should certainly be no problem at all with making any tag or namespace you like private, whether it is under the private namespace or not. Indeed, you can make your top-level namespace private if you like (though again, doing so will only affect new top-level tags and namespaces).

    But if you don’t have a strong philosophical objection, your life probably be simpler if you go along with having a namespace called private, and probably by making its contents private too.

Fluidinfo API Version and Release Date

The other small but useful change is that Fluidinfo is now publishing information about the versions of the code and API deployed.

Every time a new version of the live Fluidinfo code is deployed, the tag fluiddb/release-date on the object with about tag fluidinfo will be updated. This is a plain text string (MIME type text/plain) in ISO8601 extended format. Because it is a plain-text string, you can view it in a browser at

http://fluiddb.fluidinfo.com/about/fluidinfo/fluiddb/release-date.

or with Fish by saying

$ fish show -a fluidinfo /fluiddb/release-date
Object with about="fluidinfo":
  /fluiddb/release-date = "2011-11-04T17:40:27Z"

The intention and commitment from the team is that every time the live Fluidinfo code changes, this release date will be updated (even if there is no intended change to the API). This obviously allows library and application writers to make statements such as:

Fish version 4.12 has been tested with the Fluidinfo release of 2011-11-04T17:40:27Z and all tests pass.

(which is true).

There is also an API version published using the tag fluiddb/api-version on the same object. Again this has MIME type text/plain, and can be viewed at

http://fluiddb.fluidinfo.com/about/fluidinfo/fluiddb/api-version

It can also be displayed using Fish with the command

$ fish show -a fluidinfo /fluiddb/api-version
Object with about="fluidinfo":
  /fluiddb/api-version = "1.13"

This is updated whenever a deliberate change to the API is made. If, at some point in future, multiple APIs are supported, the intention is to extend this to be a space-separated list of APIs supported in the current release.

The API change log can be viewed at

http://doc.fluidinfo.com/fluidDB/api/changelog.html

and I’m guessing that from this point forward the change log will include the API version numbers for new features.

Labels