15 December 2011

Fragmentation and URL Normalization

I have updated the abouttag.py library to use a new, better convention for normalizing URLs. The two main changes people will notice are:

  1. URLs that represent directories will now include, rather than exclude, a trailing slash:

    http://fluidinfo.com/

    rather than

    http://fluidinfo.com
  2. There is now a dependency on the excellent urlnorm.py, by Jehiah Czebotar.

The Issue: Fragmentation

The twin evils that the abouttag.py library and this blog exist to fight are fragmentation and overloading.

Fragmentation occurs in Fluidinfo when different users store information about the same thing on different objects, while overloading occurs when people store information about different things on the same object. In general, both of these are undesirable. Fragmentation reduces data sharing and makes it harder to extract information from the system, whereas overloading creates ambiguity and confusion.

One of the more common uses for Fluidinfo is for tagging web pages, and it is very natural to use the URL as the about tag, as almost everyone does. There is not much of a problem with overloading in this case (except to the extent that URLs point to web pages that change over time), but there is definitely fragmentation.

I would distinguish between two kinds of fragmentation in the case of URLs.

  1. Different representations of the same URL. Perhaps the most obvious example is the trailing slash on many URLs. Punctilious persons with good knowledge of W3C standards (and in particular RFC3986) prefer the inclusion of a trailing slash on URLs (and more generally, on URIs) where appropriate, and thus prefer

    http://fluidinfo.com/

    to the more colloquial

    http://fluidinfo.com

    Technically, these are different URLs, but web servers so routinely and uniformly redirect the latter to the former that they can be considered for all practical purposes the same. It seems highly desirable for any convention for about tags for URLs to map these two forms, along with other similar representational variants, to a common about tag.

  2. Different URLs that may or may not represent the same web page. The most obvious example of this is the www. that used to be de rigeur and is now commonly (but not reliably) redundant. Most right-thinking webmasters (webmistresses?) routinely redirect these to the same place, there is no general guarantee that the www. form (http://www.fluidinfo.com/) and the bare form (http://fluidinfo.com/) will produce the same page, nor even that they should both work.

    Standardizing this would therefore seem to be a normalization too far.

The Old and New Behaviour of abouttag.py

Fluidinfo is far from the only system with an interest in developing a canonical or normalized form for URLs. Search engines and social bookmarking sites (such as Pinboard and Delicious) work better if different URLs representing the same resource are collapsed, and as mentioned above, there is even a standard (RFC3986) for how to perform the canonicalization.

The relevant Wikipedia page describes six normalizations that preserve URL semantics. These are:

  • Converting the scheme and host to lower case. (HTTP://http:// and FLUIDINFO.COMfluidinfo.com).
  • Capitalizing letters in escape sequences (%3a%3A)
  • Decoding percent-encoded octets of unreserved characters (%7E~)
  • Adding a trailing slash where appropriate (http://fluidinfo.comhttp://fluidinfo.com/)
  • Removing the default port (http://fluidinfo.com:80/http://fluidinfo.com/)
  • Removing dot-segments (http://fluidinfo.com/accounts/./new/http://fluidinfo.com/accounts/new/)

Happily, libraries to perform these normalizations already exist and are freely for a number of programming languages, including Python. As noted above, Jehiah Czebotar’s urlnorm.py performs the task admirably in Python, so in the version of abouttag.py that I just pushed to Github (version 0.6) I have made added a new convention, uri-2, corresponding to this behaviour and have made that the default. So now:

>>> from abouttag.uri import URI

>>> URI(u'http://fluidinfo.com')
u'http://fluidinfo.com/'

>>> URI(u'HTTP://FLUIDINFO.com:80')
u'http://fluidinfo.com/'

>>> URI(u'HTTP://FLUIDINFO.com:80')
u'http://fluidinfo.com/'

>>> URI(u'http://fluidinfo.com/a/./b/?arg=%7Ealice')
u'http://fluidinfo.com/a/b/?arg=~alice'

This is different from the old behaviour, which can be obtained by explicitly adding a convention argument of ‘uri-1’:

>>> URI(u'http://fluidinfo.com', convention=u'uri-1')
u'http://fluidinfo.com'
# note no trailing slash

>>> URI(u'HTTP://FLUIDINFO.com', convention=u'uri-1')
u'http://fluidinfo.com'
# Same downcasing, but again no trailing slash

>>> URI(u'http://fluidinfo.com:80', convention=u'uri-1')
u'http://fluidinfo.com:80'
# uri-1 didn't strip default ports

>>> URI(u'http://fluidinfo.com/a/./b/?arg=%7Ealice', convention='uri-1')
u'http://fluidinfo.com/a/./b/?arg=%7Ealice'
# nor did it undo unnecessary %-encoding or strip . & .. path segments.

Both the new and the old versions perform one additional normalization, which is to add a leading http:// if no scheme is present in the input. This is not because there is not a distinction between a domain and a URL, but rather because by calling the URI function the user is clearly indicating that this is a URI, which requires a scheme, and http:// is clearly the appropriate default scheme:

>>> URI(u'fluidinfo.com')
u'http://fluidinfo.com/'

Why...?

The reader may be wondering why I did not adhere to the RFC previously, and issued forth older versions of the abouttag library with the altogether inferior behaviour of uri-1. Ignorance, pure and simple.

No comments:

Post a Comment

Labels