24 December 2010

What Would Wikipedia Do? (WWWD)

Readers of this blog will know that some of us have long been interested in the subject of about tags and how to choose them in FluidDB. The question is important, because FluidDB’s information sharing paradigm is predicated on the idea that different users and applications share data through tagging common objects, and the about tag is the primary way to decide which object that should be. This post discusses a new idea Terry (@terrycojones) and I cooked up about a week ago. At this stage, I’m not proposing it as a convention, but simply discussing it as a possibility and soliciting comments.

WWWD / WDWD?

The kernel of idea is to follow wikipedia’s taxonomy by setting FluidDB about tags to the relative path for the wikipedia page for the entity of interest.

For example, the (english language) page for the Eiffel Tower has the page title “Eiffel Tower” and the URL

http://en.wikipedia.org/wiki/Eiffel_Tower

The putative convention under discussion, which I will tentatively call wwwd-1 for the rest of this post, would therefore be

Eiffel_Tower

i.e., if you wanted to tag the Eiffel Tower in FluidDB, the object you would use is the one whose about tag is Eiffel_Tower.

Arguably, with this example, it’s more a case of “What does Wikipedia do?” that “What would Wikipedia do?”, but we’ll come to that later.

So at its simplest, this convention says that to find the wwwd-1 about tag for an object that already exists in the English language edition of Wikipedia, you set the about tag to everything that follows http://en.wikipedia.org/wiki/ in the relevant page URL (without a trailing slash). [1]

The Case For wwwd-1

This convention has a number of things going for it, as well as a number of drawbacks. Let’s take the positives first.

1. Pragmatically, it’s easy. Wikipedia exists, is free, and is available to anyone with an internet connection (censorship aside). It is huge, beautiful and widely read, and it represents the work of countless thousands of (mostly) dedicated and intelligent humans, who have already created a rich taxonomy that—imperfect as it must be—has already disambiguated millions of terms. Following its lead is nothing if not pragmatic.

2. The Wikipedia URLs allow obvious things to have obvious, minimal URLs. Generalizing wildly, the most common pattern for URLs in wikipedia is the name of the entity with spaces replaced by underscores, punctuation %-encoded and articles stripped, in title case. Whether you like underscores or not (and I don’t), this is pretty simple, and is described (loosely) here. For example, here are some examples of about tags wwwd-1 would include:

I have glossed over some issues here, but this is not a bad set of about tags.

3. They allow us to avoid reinventing wheels. At the time I am drafting this (17:43 GMT on 24th December 2010) this wikipedia entry tells me that there are 3,511,257 entries in the english language edition. (As I proof-read the post, on 28th December, rather impressively the total appears to have increased to 3,514,459.) In the scheme of things, that isn’t all that many; if FluidDB has any success at all, we will end up with orders of magnitude more objects with about tags than that. But most of our entries will have natural about tags that will be very straightforward to determine. For example, URLs will (modulo canonicalization) be their own about tags. I think there is an excellent change that we can do better than Wikipedia for certain well-defined classes of numerous objects, such as books. But the huge value in Wikipedia’s taxonomy is that it deals with a very high proportion of the most noteworthy subjects in the world (almost by definition).

4. Even where entries to not exist in Wikipedia, they give us a kind of template to work from.

Disadvantages and Issues

While many of the benefits of the putative wwwd-1 convention are clear and impressive, there are a number of challenges, disadvantages and issues.

1. Splitting, Changing and Disambiguation. Although the URLs for longer-established Wikipedia pages are fairly stable now, they can and do change. This is not particularly problematical within Wikipedia, because there is only really one article on each page, controlled, in some cases, through reasoned debate and collective refinement, and in other cases, through edit wars. Presumably most people find articles by search most of the time anyway, and while it is clearly unfortunate if URLs change and therefore invalidate (or at least redirect) bookmarks, it is not very serious. The situation is very different in FluidDB: every user has his or her own tags, which can be attached to any object in the system. If the mainstream community collectively decides to move from (say) the object whose about tag is “Mercury” to represent the planet, to the object with the about tag “Mercury_(planet)”, it is no small or simple, automatic or guaranteed matter to get all the data moved. [2]

2. Non-uniformity. I have personally already used my Miró software to upload a structured form of data gathered from various wikipedia pages to FluidDB, including (funnily enough) every element in the period table, and every planet (and dwarf planet) in the solar system. I documented the conventions that I used as planet-1 and element-1 and they were very simple: data for Earth was stored on the object with about tag planet:Earth, data on the planet Mercury was stored on the object with the about tag planet:Mercury, and data on the element Mercury was stored on the object with about tag element:Mercury. All of this was very straightforward, and once the scheme is known, it is easy to write code to generate the about tag for the object, knowing only the name of the planet or element. In contrast, using the wwwd-1 convention, life is fairly easy for individual users wanting to add a tag to a single, particular object, but much harder for anyone wanting to upload data automatically. The non-uniformity means that you need to look up wikipedia before knowing what the about tag will be. This introduces a level of complexity not to be underestimated, and is, in my view, the single largest problem with the idea of wwwd-1.

3. Language Issues. For English speakers, there is a clear attraction to using unqualified English terms as the about tags in FluidDB. But what of the myriad other languages? One approach is to say all languages will be on the same footing, and they should all simply follow Wikipedia’s relative URLs in their own language. Mostly, different languages will use different obects for storing data about the same thing, but sometimes they will coincide; when they do coincide, the two languages will sometimes refer to the same real-world entity and sometimes to different ones. This is problematical in some situations, and less or not in others. Another option, is to follow wikipedia more closely and include a language code in the about tag. So we might have en:Earth, rather than Earth, and fr:Terre in French. If we fail to specify language, then English speakers will be likely to put information concerning café’s (informal restaurants) on the same object as french speakers will put information about coffee (the one with the about tag Cafe). More happily, if we avoid the prefix, information about Johnny Halliday will end up on the same object for both English and French speakers; indeed, this will be a particularly common pattern for entries about individual people, since their names are normally not translated, at least within languages with common or similar alphabets. (Mao, for example, would map to the about tag Mao_Zedong in both, even though his name originates in a complete different alphabet.)

4. Priority, Cultural Sensitivity and Longevity. One of the very attractive things about wwwd-1 is that the most common/likely meaning of a given term tends to win the battle for the cleanest, simplest name. This is deliberate. The disambiguation guidelines say (at the time of writing):

Although an ambiguous term may refer to more than one topic, it is often the case that one of these topics is highly likely much more likely than any other, and more likely than all the others combined – to be the subject being sought when a reader enters that ambiguous term in the Search box. If there is such a topic, then it is called the primary topic for that term. If a primary topic exists, the ambiguous term should be the title of, or redirect to, the article on that topic.

This is entirely sane and defensible. But it does mean that human judgement is required, and leads to possible charges of cultural imperialism and so forth. Terry Jones, of Monty Python fame, is surely the most famous Terry Jones and most likely target of any search in Wikipedia today. But if Fluidinfo were to take off and become the next Google, perhaps it would be a different Terry Jones who people would expect to find occupying the object with the about tag Terry Jones.

5. Ugliness and Humans vs. The Machines: wwwd-1T. It’s all very well to say that Wikipedia URLs are quite nice, simple and readable, but they are still designed for machines rather than human beings. After all, why Eiffel_Tower rather than Eiffel Tower? The answer is because best practice dictates that spaces in URLs usually be “%-encoded” as, so that, in a URL, the preferred (safe) form would be Eiffel%20Tower, which is clearly worse, not better, than Eiffel_Tower from most perspectives. The same goes for many other punctuation symbols. On the other hand, there is nothing to say that we need to use the URL, which is in any case largely automatically constructed from the page title, which does have actual spaces. So a variation of the proposal, which I actually prefer, is to use the Wikipedia page title, rather than the relative URL, as the about tag. Then we actually do put information on the Eiffel Tower on the FluidDB object with the about tag Eiffel Tower. We permit any unicode text in about tags anyway, so we might as well take advantage of that and use spaces as required. That way, when we look at the about tag, it is maximally readable, and says exactly what we want it say; and we don’t have to read through percent encodings, underscores and all that other nasty computer stuff. I will call this alternative putative convention wwwd-1T.

6. Standardization. There remain some questions about the precise about tag to use, even if following either wwwd-1 or wwwd-1T, and this arises somewhat inevitably from the fact that ordinarily URLs are not taken to be case sensitive. [3] So while the fairly strong Wikipedia convention seems to be to use Title Case, in fact this is not always followed, and nothing breaks when it is not. A case in point concerns the San Andreas Fault. While I said, above, that wwwd-1 would give us San_Andreas_Fault, at the time of writing, a search on San Andreas Fault returns a URL ending with San_andreas_fault, though the page title is San Andreas Fault. FluidDB is case-sensitive, so we would need to decide. From a readability perspective, Title Case is clearly preferable; but title case is intrinsically ambiguous other than in the context of a fixed algorithm, and there will inevitably be failures as a result of getting the case wrong. A more pragmatic suggestion might be to use all lower case, as I suggested in the book-1 convention. I will call this third variant wwwd-1L for now.

Clearly, this is a complicated issue. To my mind there are significant merits in the idea, particularly in the wwwd-1T and wwwd-1L variants, but there are also significant problems. More than anything else, I think the convention is pretty good for human end users, but pretty difficult for machines (or to put it a different way, for application writers). If it were adopted, I think it would be best adopted as a deafult convention for the case when there is no other established/better convention, and we would be better to establish alternative conventions like book-1 for classes of objects that are numerous, mostly not included in Wikipedia and capable of being generated automatically from readily available information. In cases where we departed from the relevant wwwd convention, it would be particularly fanastic if we (collectively) added some kind of pointer from from the object that would be used under that convention to the page we actually use, in cases where the wikipedia page exists. For example, we would point from The_Road_to_Wigan_Pier to book: the road to wigan pier (george orwell). (Needless to say, we can look forward to a time when not only do we have a pointer from our FluidDB objects to the corresponding Wikipedia page (or pages), where such exists, but Wikipedia has pointers back to FluidDB. If only there were some way for us to add such pointers in Wikipedia . . .)

I’d be fascinated to gather views, in the comments or elsewhere.

[1]To be clear, the putative wwwd-1 convention is not a convention for tagging Wikipedia pages. To do that, you simply use the full page URL (preferably canonicalized using the url-1 convention. The proposed convention is simply based on the Wikipedia page naming convention (practice).
[2]When I say ‘the community decides to move’ I am not referring to any kind of formal procedure: there are no such procedures with FluidDB. I can put my information wherever I like, and that is unlikely to change. I really mean the conventions in use, whether agreed in some way or simply used by the bulk of FluidDB users.
[3]More particularly, Wikipedia does not interpret URLs in a case-sensitive manner.

No comments:

Post a Comment

Labels