12 August 2011

Sequences in Fluidinfo

I’ve added a powerful new feature to Fish to allow it manage arbitrary sequences. I’ll introduce the feature first, and then there are a couple of sections later on that motivate them, for anyone interested.

Sequences in Fluidinfo

A Fish sequence is simply a numbered collection of textual items that is added to over time. For example, I have a sequence called thoughts in Fish. When I have a new thought, I say something like:

$ fish thought "Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo"

to which Fish responds:

10: Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo
2011-08-11

This provides confirmation that the thought has been recorded (since this data is read back from Fluidinfo after the write), that it is thought #10 and that it was recorded on 11th August.

I can look at my last five thoughts by using this command:

$ fish thoughts 5

6: The Art of Tagging & the Tagging of Art
2011-07-31

7: Integrated fish history across all machines and shell-fish
2011-08-02

8: We need a word for words like suBtle, that are self-descriptive in non-onomatopoeic ways.
2011-08-08

9: A website featuring dodrantal things
2011-08-10

10: Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo
2011-08-11

It defaults to the last 20 if I don’t specify. I can also filter by putting search terms on the command line:

$ fish thoughts art
6: The Art of Tagging & the Tagging of Art
2011-07-31

and I can specify a range of thoughts:

$ fish thoughts 8-9

What I’ve built into Fish isn’t, of course, a thought command, but a general capability to create sequences. The base command is mkseq, which makes a new sequence, and whose general form is:

mkseq sequence-name [plural-form [tag]]

The thought and thoughts commands were made using the command:

mkseq thought thoughts private/thought

The first argument is the name of the command to add an item to the sequence. The second argument, which is optional, is the plural form, which is used to list items in the sequence; if it is not specified, an s is added to the singular form. The final argument, also optional, is the tag to use for storing the sequence. In this case, I wanted to keep my thoughts private, so I chose to use the tag njr/private/thought; if not specified, the name of the sequence item is used for the tag, in the user’s top-level namespace, i.e. njr/thought in this case.

[Readers with intimate knowledge of Fluidinfo’s permissions system will spot a flaw with this scheme: since Fluidinfo’s permissions are not inherited, simply using the tag njr/private/thought won’t make my thoughts private, even though my private namespace has its permissions set to deny access to others. However, that will change next week, when Fluidinfo will gain create-time inheritance of permissions; and I’m not going to release this functionality until then.]

The mkseq command creates two aliases, as we can see using the alias command:

$ fish alias
thought:
  njr/.fish/alias = "seq private/thought"

$ thoughts:
  njr/.fish/alias = "listseq private/thought"

This reveals that the other two commands I’ve added are seq and listseq. The seq command adds to a sequence using the tag specified, and the listseq command lists recent items from the sequence with the tag specified.

Because Fish stores aliases in Fluidinfo itself, if I use Fish from more than one place, I’m but a sync away from being able to record thoughts from any of those places.

The organization of sequence data

There are a number of ways that the trinity of sequence commands above could have been implemented, but this is how I’ve done it. I’m going to use thoughts as the exemplar, here, but what follows applies to any Fish sequence.

The first question is: what object or objects should be used to store the sequence? In Fluidinfo, it is very natural to use one object per thought. I could have used anonymous objects (one with no about tag), and that’s not a ridiculous idea in this case, where I’m not expecting sequences to make significant use of Fluidinfo’s social aspects. In fact, however, what I decided to to was to use the integers. So my first thought is stored on the object whose about tag is 0, the second is on the object with about tag 1 and so forth.

Thus, if I use Fish’s tags command, I should see my thought number 10:

$ fish tags -a 10
Object with about="10":
  /fluiddb/about = "10"
  /musicbrainz.org/related-albums = {"album:10 (asleep at the wheel)", "album:10 (box set)", "album:10 (cali gari)", "album:10 (divididos)", "album:10 (dj db)", "album:10 (enuff znuff)", "album:10 (freestyle)", "album:10 (harri marstio)", "album:10 (i8u)", "album:10 (johan johansson)", "album:10 (john anderson)", "album:10 (k s choice)", "album:10 (kate rusby)", "album:10 (linea 77)", "album:10 (liroy)", "album:10 (ll cool j)", "album:10 (mcguffey lane)", "album:10 (mercyme)", "album:10 (miskolci ütősök)", "album:10 (perplex)", "album:10 (sharon kips)", "album:10 (suat suna)", "album:10 (supersilent)", "album:10 (the dramatics)", "album:10 (the fools)", "album:10 (the stranglers)", "album:10 (user)", "album:10 (various artists)", "album:10 (wet wet wet)", "album:10 (тент)"}
  /musicbrainz.org/related-artists = {"artist:10"}
  /njr/index/about
  /njr/private/thought = "Perhaps savethewords.org would like to add threatened words and their definitions to Fluidinfo"
  /njr/private/thought-date = 20110811.0735
  /njr/private/thought-number = 10
  /wordtools/gcide = "10 \10\ adj.
1. denoting a quantity consisting of one more than nine and
one less than eleven; -- representing the number ten as
Arabic numerals

Syn: ten, x
[WordNet 1.5 +PJC]"

And indeed, in amongst other data, you see three njr-thought tag and couple of others. The thought itself is stored on the tag njr/private/thought, as a string. (Don’t try this at home though: njr/private is private, notwithstanding publication of select extacts on this blog, and will be invisible to users who are not njr.)

Its number is stored, as a numeric value, on the tag njr/thought-number: this might seem redundant, given that the about tag for the object also stores that information, but the fact it is numeric is extremely useful, allowing me to query based on the thought number using inequalities.

Finally, the date is stored as a numeric value. Fluidinfo doesn’t support a date type, so the way I decided to store dates was as a numeric value in the form YYYYMMDD.hhmmss. This again makes it easy to do queries based on a date range if I want to. I am just using the local time and not worrying about time zone issues. Using Greenwich Mean Time (GMT; or Coordinated Universal Time, UTC, as I believe modern anti-imperialists are supposed to call it) would have ensured that the time ordering of the datestamps was more consistent with the sequence item numbers, but at the cost of making querying less natural, especially for people not lucky enough to live near longitude 0º.

In addition to using these three tags to store the sequence itself, Fish uses one further tag to record the next sequence number to allocate, so as to avoid the need to perform a complex query to know this. The tag it uses is, in this case, njr/private/thought-next and it stores this on the object for the user njr. Taking advantage of command substitution through left-quoting, and Fish’s integration of the abouttag library, we can see this by saying:

$ fish show -a "`fish abouttag fi-user njr`" private/thought-next
Object with about="Object for the user named njr":
/njr/private/thought-next = 11

In case this isn’t clear, Fish first executes the command

$ fish abouttag fi-user njr

which returns the about tag Fluidinfo uses for the user object for njr. The show command then becomes:

$ fish show -a "Object for the user named njr" private/thought-next

which yields the expected result, 11.

That about tag convention

I imagine some readers might be raising an eyebrow at my use of the objects with about tags corresponding to the integers. Wouldn’t I normally argue that about tags should be specific, encourage sharing and discourage pollution of useful objects with irrelevant information?

All these things are true, but I think that, for the most part, they don’t apply in this case. I think most of these sequences will be private anyway, and in that sense there is no anti-social pollution involved in using the integer about tags. More fundamentally, however, even if they are not private, I don’t think there is any great need for these items to be social. Where data is not social, and especially when private, the choice of object in Fluidinfo is purely for the convenience of the user. There is both logic and convenience in using the integers.

There is a partial down-side to using the same objects for different sequences, and it is that any extra information needs to be stored in sequence-specific tags. This is one of the reasons that I used njr/private/thought-date rather than a more generic njr/datestamp. It also means that if I wanted to add tags to the object, they would also have to be similarly qualified. It may be that it will turn out that these are strong reasons to move away from the integers, but for the moment I feel the benefits outweight the disadvantages. I may add support later for sequences using either anonyous objects or ones with more specific about tags (like thought-10 or even njr/thought-10). But for now, this is it.

As always, I’ll be interested in any thoughts; I hope to roll this out next week semo time, after the API update.

I’ll finish with a couple of sections describing where the motivation for sequences in Fluidinfo and Fish came from, in case anyone is interested.

The Importance of the Palm Pilot

_images/PalmVxSmall.jpeg

Before 1999, I had always had a problem organizing information. Wherever and however I recorded it, I would struggle to find it or access it when I needed it. The information came in many shapes and sizes—books I wanted to buy, recipes, phone numbers, quotations, thoughts, plans, ideas, locations of items I had found “a safe place” for and so forth. None of it was very voluminous, but I wanted access to it everywhere and I wanted to be able to search it and organize it.

In 1999, I bought a Palm V, and from that point forward it’s not much of an exaggeration to say that my problems with information disappeared. The Palm became the place to record all (low-volume) information. I carried it almost everywhere when I was out, and I synchronized it to my computer, and this meant I almost always had access to all of it

The Palm V was brilliant in a number of ways. It had four primary applications—notes, to-do lists, the address book, and the calendar. That doesn’t sound like much, but when combined with a couple of core feaures of Palm OS, and a little imagination, it became a tremendously powerful system for organizing little pieces of information. For me, the two extra crucial features were search and categorization. In 1999, Palm OS had full text search across all data in the major applications, and it worked superbly. It was actually better than most so called full-text search today. (As a trivial example, my iPhone today still won’t search photograph names.)

The second core feature of Palm OS that added to its organizational prowess was categories. Everything could be categorized, and though they were a little mean in allowing (I think) only 12 categories, that was just enough. I could organize my little bits of information into categories like books, food, travel, work and so forth. Even more importantly, I could keep separate to-do lists in a dozen or so categories.

Perhaps most interesting application to me was the Address Book, which I subverted as a database for anything to do with people. I kept a few categories of contacts in there (friends, family, business etc.), but I also had categories like books, music and quotations. For example, I stored quotations under their originator in the quotations category. I stored information about books under their author in the books category and so forth. It was mad, but worked remarkably well.

By being almost always with me, both at my desk and on the go, and providing categorization and full-text search, the data in the Palm could genuinely function as my central repository for little bits of information. A problem I had always had went away.

I moved from a Palm V to a Vx, then to a Sony Clié (also running Palm OS) and eventually to a Palm T|X, all of which were fine machines, and the data came with me. But four years ago, the iPhone was born, and there was not a moment’s doubt that this was the device that would finally allow me to move from carrying a phone and a PDA to a single device. (The Handsprings were never tempting, and nothing before or since, I would argue, comes close to the iPhone in overall power and utility.) But marvellous as the iPhone is, despite its ever-improving synchronization methods, the iPhone + Mac combination is not as powerful for organizing little bits of information as the Palm was.

The Power of Logging

_images/miro.png

I am not particularly happy about organizations collecting ever-increasing quantities of data about me, though like most people I put up with it in some cases because of the benefits that come with, for example, carrying a device that geolocates me day and night.

I am, however, much happier about the idea of recording information myself, for my benefit, under my control.

For the last four years or so, I have been working on a the Artists Suite, a set of data analysis tools whose lead component is Miró. From very early on, I was clear that I wanted the software to log almost everything it did. Miró is a command-driven system, and among things, it records every command issued to it (on a per-session basis) and all the results it generates. Miró never deletes any of its history (though users can, obviously) and as a result I now have nearly four years of logs detailing how I’ve used Miró myself—logs of well over 10,000 analysis sessions. This has turned out to be even more useful than I expected. It helps, I suppose, that as an analysis tool, Miró can read, search and analyse its own logs. But the fundamental power lies simply in recording everything, with no effort required on my part. Miró annotates the data, of course, with datestamps and sequence numbers, and records how the session was invoked and so forth, and I regularly go back and use the logs to reconstruct what I did previously in a way I simply would be unable to do otherwise. This had made a poerful impression on me.

11 August 2011

Gigs in Fluidinfo

I’ve written previously on about-tag conventions for music in Fluidinfo, and if you search for almost any artist or album in Fluidinfo you’ll see the great progress that Eric Seidel (@gridaphobe) has already made in this area, with more to come.

So far, the main conventions I’ve discussed concern artists, albums and tracks, which are exemplified by

What about gigs?

My first thoughts are the following:

  • Most gigs are primarily defined by an artist (or sometimes artists) and a date.
  • Very occasionally an artist may play two gigs in one day; in that case, adding a venue or a time ought to be sufficient to disambigate it
  • As usual, these things are easy to know if you attend the gig, have a ticket or a listing, and are pretty unambiguous.

I therefore think there are two prime options for defining gigs. The first would to use only the artist and date, in the main case, and to qualify when necessary with a disambiguating time (this being less ambiguous than place). The second option is to use artist, date and start time in every case. (I suppose time zone considerations could mess up the non-ambiguity of full datestamps, but I don’t care.)

The former has the considerable advantage of simplicity, and I would imagine will be unique well over 99.9% of the time; the latter is more precise and uniform, but has the significant disadvantage that start times are harder to know and less clearly defined.

On balance, therefore, I am tempted to suggest that gigs be exemplified by the following:

gig:dean friedman (2011-08-10)

and, where an artist performs more than once on the same day, and when it is important to distinguish:

gig:dean friedman (2011-08-10:21:00)

(I think sub-minute accuracy will probably not be required in this case.

As an example, Dean Friedman, surely one of the finest songwriters and performers playing today, performed last night, as part of the Edinburgh Fringe, at the Music Box, at Stevenson College. Here is the object for that gig

and my here is the address of my njr/review tag for it:

http://fluiddb.fluidinfo.com/about/gig:dean friedman (2011-08-10)/njr/review

It’s value is:

$ fish show -a 'gig:dean friedman (2011-08-10)' review
Object with about="gig:dean friedman (2011-08-10)":
  /njr/review = "http://fluiddb.fluidinfo.com/about/gig:dean%20friedman%20%282011-08-10%29/njr/review.html"

That URL, points to the review itself, which is stored in Fluidinfo, in the tag njr/review.html, which can be view at http://fluiddb.fluidinfo.com/about/gig:dean friedman (2011-08-10)/njr/review.html

A reasonable alternative to gig be concert, which I suspect will be used in preference for classic music, but the ubiquity of the term gig for non-classical music (and even, colloquially, for classical concerts) suggests that gig is the better starting point.

09 August 2011

Alias Fish and Sync

I’ve been making a number of changes to the Fluidinfo shell, Fish. I haven’t pushed them to GitHub yet or the online version, Shell-Fish, yet, for various reasons, but I can start to document them.

Aliases

The provision of aliases is a fairly basic part of a shell that Fish has lacked until now. I have added a simple form of aliasing that is exemplified by the following example:

fish alias plp 'show -q "has njr/lastpage" /about'

This creates an alias called plp that expands to the text given so that plp will show the about tags for any objects tagged with the njr/lastpage tag. If I run this command now I get:

$ fish plp
1 object matched
Object f79d5ea3-50c1-4c9e-b98e-7bbe46b69ee1:
  /fluiddb/about = "http://www.guardian.co.uk/"

because the page tagged with njr/lastpage is currently the Guardian’s website.

A couple of syntactic details to note about these aliases:

  1. The alias only applies to the first (non-flag) word in the command.

  2. Any arguments that follow the aliased term are added to the substituted command. So, for example, the alias

    alias parisrating 'show -a "Paris" rating'

    if invoked as

    fish parisrating /alice/rating

    will show both my rating of Paris (as specified in the alias), and Alice’s, as specified in the command, viz:

    $ fish parisrating /alice/rating
    Object with about="Paris":
      /njr/rating = 9
      /alice/rating = "smelly"
  3. There is no provision for using positional arguments such as $1 yet. This will change, as surely as night follows day.

So far, so boring. But here’s something slightly more interesting: where should Fish store its aliases?

Storing Aliases in Fluidinfo

The question need only be posed for the answer to present itself: obviously, Fish should store aliases in Fluidinfo; as it does. This turns out to be quite interesting.

Fish stores aliases on objects whose about tag is the alias; in other words, the alias parisrating is stored on the object whose about tag is parisrating:

$ fish tags -a parisrating
Object with about="parisrating":
  /fluiddb/about = "parisrating"
  /njr/.fish/alias = "show -a "Paris" rating"

As you can see, the alias is stored in a tag called njr/.fish/alias, and the value of that tag is simply the expansion text. (I know, I know: the output is confusing, embedding, as it does, double quotes in a double-quoted string, without any escaping. There is an excuse, but not one that’s worth the electrons.)

Both the .fish namespace and the .fish/alias tag are private, by default, as you can see:

$ fish ls -ld .fish .fish/alias
nrwc------   njr/.fish/
trwc------   njr/.fish/alias

or if you prefer your permissions Fluidinfo-style:

$ fish ls -Ld .fish .fish/alias

njr/.fish/:
     read: policy: closed; exceptions = [njr]
    write: policy: closed; exceptions = [njr]
  control: policy: closed; exceptions = [njr]


njr/.fish/alias:
     read: policy: closed; exceptions = [njr]
    write: policy: closed; exceptions = [njr]
  control: policy: closed; exceptions = [njr]

Of course, the user can change this.

Storing the alias this way involves a small leakage of potentially private information, in the sense that people can see that an object with the about tag parisrating exists, though not that anyone is using it as an alias, or who is doing so, or what the alias expands to. I could have avoided this by using anonymous objects and another .fish tag instead of the about tag, but I think the approach I’ve adopted is better overall. If you share my philosphical perspective that objects for every possible about tag already exist, but are lazily instantiated, there is no leakage, but obviously in the non-platonic Amazon server farm where the data is hosted, Plato’s writ does not run. (This is probably a good thing; imagine what Amazon would charge to host Fluidinfo’s data if we admitted that every possible string is stored.)

The principal advantage of storing the alias in Fluidinfo is that it is available from anywhere. So if I have Fish on several machines (and I do) I can create the alias once and use it from everywhere. In principle, I can also use it from the online version of Fish (Shell-Fish), but that isn’t implemented yet, which is one of the reasons I haven’t committed this to GitHub yet.

There is also a downside to storing aliases in Fluidinfo, which is that retrieving the definitions requires a Fluidinfo query. Given that alias expansion precedes command matching (so that built-in Fish commands can be replaced), this lookup is required before each command is evaluated; for Fish’s single-shot mode, where a single Fish command is entered from a Unix or Windows prompt, that introduces a delay I am not keen to accept.

For this reason, the Fish cache has been born.

Caching Fluidinfo for Fish

The Fish cache is simply a dump of a Fish’s internal representation of certain Fluidinfo objects to local storage. Specifically, on Unix, Fish writes a file (a pickle file) to ~/.fishcache.username where username is the authenticated user’s username. When Fish is invoked with arguments, it reads the cache, which includes all the objects used for aliases.

When an alias is created (or deleted), Fish first makes the change in Fluidinfo, then updates the cache.

There is then a new sync command, which updates the cache from Fluidinfo. It is important to be clear that this synchronization does not involve any kind of mediation: the cache is cleared and updated from Fluidinfo; Fluidinfo is the source of truth.

This means that in order to get any aliases (or alias deletions) performed by a different copy of Fish, you need to perform a sync operation.

When Fish is invoked without arguments, Fish performs a sync before accepting any input.

The plan for the online version of Fish is very similar to the interactive version except that instead of using local files, the online version will store its cache in the “local” database (which, in the case of Google App Engine, means the Data Store).

I think this approach holds great promise, not only for aliases, but for other important Fish data, probably including configuration parameters.

Summary of the alias, unalias and sync and showcache commands

alias

The alias command is summarized as follows:

alias [name [expansion-text]]

With no parameters, alias lists all aliases and their expansions.

With a single parameter, alias lists the expansion for the alias specified (if it exists).

With two or more parameters, alias defines (or redefines) an alias. It is best to quote the expansion text as a single parameter to stop Fish from interpreting it, though in simple cases this is strictly unnecessary. Here are some examples:

$ fish alias book 'abouttag book'

$ fish book 'Fugitive Pieces' 'Anne Michaels'
book:fugitive pieces (anne michaels)

$ fish alias book
book:
  njr/.fish/alias = "abouttag book"

$ fish alias
book:
  njr/.fish/alias = "abouttag book"

parisrating:
  njr/.fish/alias = "show -a "Paris" rating"

plp:
  njr/.fish/alias = "show -q "has njr/lastpage" /about"

unalias

There is also an unalias command. You can probably guess how it works. To remove the three aliases above you would say:

fish unalias book parisrating plp

showcache and sync

Finally, the showcache command can be used to show the contents of the cache, and the sync command can be used to update it from Fluidinfo.

In this particular case, I deleted the parisrating alias on another machine with this result:

$ sync

$ showcache
Cache:

  fluiddb/about="plp":
      njr/.fish/alias = "show -q "has njr/lastpage" /about"

  fluiddb/about="book":
      njr/.fish/alias = "abouttag book"

Today, the cache stores only aliases, so the output from showcache tends to look rather similar to that from alias, but as other types of data start to be cached, a sharper distinction will be drawn.

I have a good feeling about this.

Post Script

If you are using Fluidinfo directly, and are not taking advantage of the /values API, you really should check it out: it is dramatically faster. I have been a bit slow to upgrade Fish to use it, but that is now happening incrementally, and is yielding very significant and welcome performance improvements everywhere. Think of the difference between drinking beer with a straw and glugging it straight from the glass; there is simply no comparison.

[Thanks to @joannescrub for taking the time to point out a number of typos in this post.]

Labels