12 August 2011

Sequences in Fluidinfo

I’ve added a powerful new feature to Fish to allow it manage arbitrary sequences. I’ll introduce the feature first, and then there are a couple of sections later on that motivate them, for anyone interested.

Sequences in Fluidinfo

A Fish sequence is simply a numbered collection of textual items that is added to over time. For example, I have a sequence called thoughts in Fish. When I have a new thought, I say something like:

$ fish thought "Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo"

to which Fish responds:

10: Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo
2011-08-11

This provides confirmation that the thought has been recorded (since this data is read back from Fluidinfo after the write), that it is thought #10 and that it was recorded on 11th August.

I can look at my last five thoughts by using this command:

$ fish thoughts 5

6: The Art of Tagging & the Tagging of Art
2011-07-31

7: Integrated fish history across all machines and shell-fish
2011-08-02

8: We need a word for words like suBtle, that are self-descriptive in non-onomatopoeic ways.
2011-08-08

9: A website featuring dodrantal things
2011-08-10

10: Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo
2011-08-11

It defaults to the last 20 if I don’t specify. I can also filter by putting search terms on the command line:

$ fish thoughts art
6: The Art of Tagging & the Tagging of Art
2011-07-31

and I can specify a range of thoughts:

$ fish thoughts 8-9

What I’ve built into Fish isn’t, of course, a thought command, but a general capability to create sequences. The base command is mkseq, which makes a new sequence, and whose general form is:

mkseq sequence-name [plural-form [tag]]

The thought and thoughts commands were made using the command:

mkseq thought thoughts private/thought

The first argument is the name of the command to add an item to the sequence. The second argument, which is optional, is the plural form, which is used to list items in the sequence; if it is not specified, an s is added to the singular form. The final argument, also optional, is the tag to use for storing the sequence. In this case, I wanted to keep my thoughts private, so I chose to use the tag njr/private/thought; if not specified, the name of the sequence item is used for the tag, in the user’s top-level namespace, i.e. njr/thought in this case.

[Readers with intimate knowledge of Fluidinfo’s permissions system will spot a flaw with this scheme: since Fluidinfo’s permissions are not inherited, simply using the tag njr/private/thought won’t make my thoughts private, even though my private namespace has its permissions set to deny access to others. However, that will change next week, when Fluidinfo will gain create-time inheritance of permissions; and I’m not going to release this functionality until then.]

The mkseq command creates two aliases, as we can see using the alias command:

$ fish alias
thought:
  njr/.fish/alias = "seq private/thought"

$ thoughts:
  njr/.fish/alias = "listseq private/thought"

This reveals that the other two commands I’ve added are seq and listseq. The seq command adds to a sequence using the tag specified, and the listseq command lists recent items from the sequence with the tag specified.

Because Fish stores aliases in Fluidinfo itself, if I use Fish from more than one place, I’m but a sync away from being able to record thoughts from any of those places.

The organization of sequence data

There are a number of ways that the trinity of sequence commands above could have been implemented, but this is how I’ve done it. I’m going to use thoughts as the exemplar, here, but what follows applies to any Fish sequence.

The first question is: what object or objects should be used to store the sequence? In Fluidinfo, it is very natural to use one object per thought. I could have used anonymous objects (one with no about tag), and that’s not a ridiculous idea in this case, where I’m not expecting sequences to make significant use of Fluidinfo’s social aspects. In fact, however, what I decided to to was to use the integers. So my first thought is stored on the object whose about tag is 0, the second is on the object with about tag 1 and so forth.

Thus, if I use Fish’s tags command, I should see my thought number 10:

$ fish tags -a 10
Object with about="10":
  /fluiddb/about = "10"
  /musicbrainz.org/related-albums = {"album:10 (asleep at the wheel)", "album:10 (box set)", "album:10 (cali gari)", "album:10 (divididos)", "album:10 (dj db)", "album:10 (enuff znuff)", "album:10 (freestyle)", "album:10 (harri marstio)", "album:10 (i8u)", "album:10 (johan johansson)", "album:10 (john anderson)", "album:10 (k s choice)", "album:10 (kate rusby)", "album:10 (linea 77)", "album:10 (liroy)", "album:10 (ll cool j)", "album:10 (mcguffey lane)", "album:10 (mercyme)", "album:10 (miskolci ütősök)", "album:10 (perplex)", "album:10 (sharon kips)", "album:10 (suat suna)", "album:10 (supersilent)", "album:10 (the dramatics)", "album:10 (the fools)", "album:10 (the stranglers)", "album:10 (user)", "album:10 (various artists)", "album:10 (wet wet wet)", "album:10 (тент)"}
  /musicbrainz.org/related-artists = {"artist:10"}
  /njr/index/about
  /njr/private/thought = "Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo"
  /njr/private/thought-date = 20110811.0735
  /njr/private/thought-number = 10
  /wordtools/gcide = "10 \10\ adj.
1. denoting a quantity consisting of one more than nine and
one less than eleven; -- representing the number ten as
Arabic numerals

Syn: ten, x
[WordNet 1.5 +PJC]"

And indeed, in amongst other data, you see three njr-thought tag and couple of others. The thought itself is stored on the tag njr/private/thought, as a string. (Don’t try this at home though: njr/private is private, notwithstanding publication of select extacts on this blog, and will be invisible to users who are not njr.)

Its number is stored, as a numeric value, on the tag njr/thought-number: this might seem redundant, given that the about tag for the object also stores that information, but the fact it is numeric is extremely useful, allowing me to query based on the thought number using inequalities.

Finally, the date is stored as a numeric value. Fluidinfo doesn’t support a date type, so the way I decided to store dates was as a numeric value in the form YYYYMMDD.hhmmss. This again makes it easy to do queries based on a date range if I want to. I am just using the local time and not worrying about time zone issues. Using Greenwich Mean Time (GMT; or Coordinated Universal Time, UTC, as I believe modern anti-imperialists are supposed to call it) would have ensured that the time ordering of the datestamps was more consistent with the sequence item numbers, but at the cost of making querying less natural, especially for people not lucky enough to live near longitude 0º.

In addition to using these three tags to store the sequence itself, Fish uses one further tag to record the next sequence number to allocate, so as to avoid the need to perform a complex query to know this. The tag it uses is, in this case, njr/private/thought-next and it stores this on the object for the user njr. Taking advantage of command substitution through left-quoting, and Fish’s integration of the abouttag library, we can see this by saying:

$ fish show -a "`fish abouttag fi-user njr`" private/thought-next
Object with about="Object for the user named njr":
/njr/private/thought-next = 11

In case this isn’t clear, Fish first executes the command

$ fish abouttag fi-user njr

which returns the about tag Fluidinfo uses for the user object for njr. The show command then becomes:

$ fish show -a "Object for the user named njr" private/thought-next

which yields the expected result, 11.

That about tag convention

I imagine some readers might be raising an eyebrow at my use of the objects with about tags corresponding to the integers. Wouldn’t I normally argue that about tags should be specific, encourage sharing and discourage pollution of useful objects with irrelevant information?

All these things are true, but I think that, for the most part, they don’t apply in this case. I think most of these sequences will be private anyway, and in that sense there is no anti-social pollution involved in using the integer about tags. More fundamentally, however, even if they are not private, I don’t think there is any great need for these items to be social. Where data is not social, and especially when private, the choice of object in Fluidinfo is purely for the convenience of the user. There is both logic and convenience in using the integers.

There is a partial down-side to using the same objects for different sequences, and it is that any extra information needs to be stored in sequence-specific tags. This is one of the reasons that I used njr/private/thought-date rather than a more generic njr/datestamp. It also means that if I wanted to add tags to the object, they would also have to be similarly qualified. It may be that it will turn out that these are strong reasons to move away from the integers, but for the moment I feel the benefits outweight the disadvantages. I may add support later for sequences using either anonyous objects or ones with more specific about tags (like thought-10 or even njr/thought-10). But for now, this is it.

As always, I’ll be interested in any thoughts; I hope to roll this out next week semo time, after the API update.

I’ll finish with a couple of sections describing where the motivation for sequences in Fluidinfo and Fish came from, in case anyone is interested.

The Importance of the Palm Pilot

_images/PalmVxSmall.jpeg

Before 1999, I had always had a problem organizing information. Wherever and however I recorded it, I would struggle to find it or access it when I needed it. The information came in many shapes and sizes—books I wanted to buy, recipes, phone numbers, quotations, thoughts, plans, ideas, locations of items I had found “a safe place” for and so forth. None of it was very voluminous, but I wanted access to it everywhere and I wanted to be able to search it and organize it.

In 1999, I bought a Palm V, and from that point forward it’s not much of an exaggeration to say that my problems with information disappeared. The Palm became the place to record all (low-volume) information. I carried it almost everywhere when I was out, and I synchronized it to my computer, and this meant I almost always had access to all of it

The Palm V was brilliant in a number of ways. It had four primary applications—notes, to-do lists, the address book, and the calendar. That doesn’t sound like much, but when combined with a couple of core feaures of Palm OS, and a little imagination, it became a tremendously powerful system for organizing little pieces of information. For me, the two extra crucial features were search and categorization. In 1999, Palm OS had full text search across all data in the major applications, and it worked superbly. It was actually better than most so called full-text search today. (As a trivial example, my iPhone today still won’t search photograph names.)

The second core feature of Palm OS that added to its organizational prowess was categories. Everything could be categorized, and though they were a little mean in allowing (I think) only 12 categories, that was just enough. I could organize my little bits of information into categories like books, food, travel, work and so forth. Even more importantly, I could keep separate to-do lists in a dozen or so categories.

Perhaps most interesting application to me was the Address Book, which I subverted as a database for anything to do with people. I kept a few categories of contacts in there (friends, family, business etc.), but I also had categories like books, music and quotations. For example, I stored quotations under their originator in the quotations category. I stored information about books under their author in the books category and so forth. It was mad, but worked remarkably well.

By being almost always with me, both at my desk and on the go, and providing categorization and full-text search, the data in the Palm could genuinely function as my central repository for little bits of information. A problem I had always had went away.

I moved from a Palm V to a Vx, then to a Sony Clié (also running Palm OS) and eventually to a Palm T|X, all of which were fine machines, and the data came with me. But four years ago, the iPhone was born, and there was not a moment’s doubt that this was the device that would finally allow me to move from carrying a phone and a PDA to a single device. (The Handsprings were never tempting, and nothing before or since, I would argue, comes close to the iPhone in overall power and utility.) But marvellous as the iPhone is, despite its ever-improving synchronization methods, the iPhone + Mac combination is not as powerful for organizing little bits of information as the Palm was.

The Power of Logging

_images/miro.png

I am not particularly happy about organizations collecting ever-increasing quantities of data about me, though like most people I put up with it in some cases because of the benefits that come with, for example, carrying a device that geolocates me day and night.

I am, however, much happier about the idea of recording information myself, for my benefit, under my control.

For the last four years or so, I have been working on a the Artists Suite, a set of data analysis tools whose lead component is Miró. From very early on, I was clear that I wanted the software to log almost everything it did. Miró is a command-driven system, and among things, it records every command issued to it (on a per-session basis) and all the results it generates. Miró never deletes any of its history (though users can, obviously) and as a result I now have nearly four years of logs detailing how I’ve used Miró myself—logs of well over 10,000 analysis sessions. This has turned out to be even more useful than I expected. It helps, I suppose, that as an analysis tool, Miró can read, search and analyse its own logs. But the fundamental power lies simply in recording everything, with no effort required on my part. Miró annotates the data, of course, with datestamps and sequence numbers, and records how the session was invoked and so forth, and I regularly go back and use the logs to reconstruct what I did previously in a way I simply would be unable to do otherwise. This had made a poerful impression on me.

No comments:

Post a Comment

Labels