Purple Stuff not Included


Thanks to Aaron Swartz for getting this 'g set up in doubleplusfeckingquick time, and for hosting on the magnificent Vorpal everything written here. I snicker-snack to you, Vorpal.

Back through the mists of time, there existed a 'g called "Bring It On Home". BIOH, as I came to call it, was an experiment in pumping out mundane bits of whimsy into a Webpage; miscoranda is not going to be like that, I hope.

What you can expect from miscoranda is stuff. Lots and lots of stuff. From time to time I write bits of prose that I don't get to post anywhere, and now I have miscoranda as an outlet for them. I also hack up bits of code (mainly Python), and little Semantic Web experiments, which may draw in some of the rubes.

So that's about it for this introduction, except for the traditional quote.

Phoebe: I can't say croissant. [realizes] Oh my God!

Thanks for reading, and enjoy miscoranda.

Short Script


I've written a small Python script that compiles a page with "<&" and "&>" delimited bits of Python embedded in it. I originally wrote it as part of a server framework that I was developing, but found that it was useful by itself.

It's only a short bit of code, and it may have been done (better) elsewhere, but here it is anyway. Share and enjoy: GPL 2.



PServer is a little Python script that I just wrote, which can proxy HTTP via. HTTP GET. It runs a local server that enables people to access the Web through your network. Obviously, I don't endorse any misuse (or even use) of this program, and the usual caveat usor applies: make sure that you know what the code does before you run it. Released under GPL 2: share and enjoy!

The Semantic Web... in Haiku


An Introduction / to the hard Semantic Web / in simple Haiku is a follow-on article from my own popular Introduction to the Semantic Web . It provides details of the RDF model, the serializations of RDF (including an XML RDF tutorial), and various Semantic Web vocabularies.

You'll need a cursory understanding of XML to be able to work through the XML RDF section, but otherwise it should be quite accomodating to the relative newcomer. It's all in good fun, though, so don't take it too seriously.

If you like this, and you're getting into Web accessibility, you might want to try WCAG in Haiku too.



For a while now, I've been wanting an application that could quickly convert \uHHHH or U+HHHH into UTF-8 encoded Unicode, and vice versa. I had been using some convoluted method that I won't go into here, but then I found Wesley Phoa's clipboard.py module for interacting with the Windows clipboard. (N.B. It requires Sam Rushing's dynwin module.)

After that, all I needed to write was this simple script to convert the clipboard's contents to UTF-8:-

import re, clipboard

     lambda m: unichr(int(m.group(2), 16)),

and this one for converting from UTF-8 back into \uHHHH:-

     lambda m: '\\u%04x'.upper() % ord(m.group(1)),
     unicode(clipboard.getText(), 'utf-8'))))

Simple. A couple of "uniclip" shortcuts and a few tests later, and I'm very satisfied with the results.

I Thôt that I Ôt


Jonathan Swift's little Parody of the Tendency amongst some Speakers of the English Language to abbreviate to an Absurdity is wonderful, and not just for its rampant Noun Capitalization (which is, notably, a Feature of German). It was first published in the Tatler of September 28th 1710, and I was glad to find it on the Web.

Whilst we're discussing the comparative Vulgarity of Elisions, I may as well note that miscoranda is a portmanteau word—a contraction of "miscellaneous memoranda", if you hadn't already worked that out. I decided to use the Latin plural for "memorandum", because "memorandums" is just plain silly.

Oh, and the "c" in miscoranda is silent. Sshhh!

What a Character


I like some Unicode characters more than others, and, well, who doesn't? We all have our little favourites; little companions that make us smile when we browse through the thousands of codepoints. With that in mind, I thought I'd trace my own top ten list of those titilating typographical tokens.

U+00DE Thorn (Þ)
Michael Everson's Sorting Thorn alone would be enough to warrant the inclusion of this character in the top ten. The thorn was originally a runic character that found its way into the Old English alphabet.
U+01C3 Retroflex Click (ǃ)
Looks like an exclamation mark, but it actually represents a certain clicking sound found in some African languages, e.g. Khoisan.
U+03F5 Lunate Epsilon (ϵ)
"Lunate" because it's shaped like the crescent moon.
U+2027 Hyphenation Point (‧)
This is used to separate syllables where the pronunciation needs to be given, e.g. in dictionaries.
U+203D Interrobang (‽)
The interrobang is "?!" melded together into a single character. According to Interrobang-Mks, it was invented in 1962 by Martin K. Speckter.
U+2042 Asterism (⁂)
In astronomical terms, an asterism is a group of stars which is not a constellation (from Greek, asterismos). In typography it is a character that uses three stars as its glyph, and is used to draw attention to a piece of text; to mark it as important.
U+2302 House (⌂)
Useful character: if browsers would display the glyph properly, it could be used to point to the homepage of the site, or to link to a geo address page.
U+2603 Snowman (☃)
Yep, it's a snowman; the first great character of the miscellanea. The characters in this block are intended for decoration, or to facilitate the printing of manuals etc. (or so the Unicode Specification would have us believe).
U+2604 Comet (☄)
Together with the asterism, and U+2606 (White Star), this character can be used to form a representation of the night sky. O.K., maybe not. Like snowman before it, this is one of the miscellaneous block.
U+262F Yin Yang (☯)
Yin and Yang, in a single character. Good idea to include this to the character set, although it's one of those characters that won't be used all that often. There are also trigrams (U+2630–2637), and chess pieces (U+2654–2660) which were too numerous to include here.



Scientists have found an 800 mile diameter Kuiper belt object that may be big enough to qualify for the tenth planet, although that seems unlikely since it's roughly half the size of Pluto. Dubbed "Quaoar", it was discovered by Michael Brown and Chadwick Trujillo back in June of this year.

This will open up the old debate about what constitutes a planet, and may even lead to Pluto being stripped of its planetary status. Meanwhile, there will even be debates over the new object's name; "Quaoar" is an unofficial designation, and is likely to change.

More coverage: the world of giant Quaoar is the frozen limit, The Independent; Quaoar, the newest planet... or is it?, the age; Large world found beyond Pluto, BBC News.

Warning: Semantic RegExps Ahead


After working on the Semantic Web for a while, a kind of wonderful insanity creeps over you, and you start to wonder how existing systems would be if all the substantial bits were replaced with URIs (or their syntactic abbreviations, QNames). So, being the Semantic sucker that I am, when I read the section about scanf() simulation with RegExp in the Python documentation, I wondered if those formatting tokens could be replaced with QNames. You might get something like '{re:str} - {re:digit} errors, {re:digit} warnings'.


There are some things that you really don't need decentralized and distributed solutions for, and RegExp groups just happen to be one of them.

The only benefit to Semantic RegExps is that you would be able to attach further information to the groups identified with QNames. In some RegExps you can uniquely label groups anyway (the syntax for that in Python is (?P<name>...)), so we're half way there.

Odds and Ends


Today I realized that "Anglo Saxon" is much better as an adjective than "Old English". I thought a bit about expressing BNF in RDF and decided that it wouldn't be worth the bother unless you adopted something like Sandro's pet project. I found that Project Gutenberg is irritatingly slow these days, but at least it carries a good selection (update: Aaron informs me about the archive.org mirror of PG; nice interface, too).

I coined the word "unpriggish", and then found that Google had already picked up nearly 30 instances of its use. I learned that nobody really knows when "i18n", the standard abbreviation for "internationalization", originated (although the earliest Usenet message that Google has containing it is from November 1989).

I also found it ironic that a speed camera warning sign was posted at a point of roadblock gridlock.

"You know what I'm sayin' and you know what I mean."—Odds and Ends, Bob Dylan, 1967.

Mr. Swartz went to Washington


I wanted to make miscoranda a place for original thought, with little regard for trends of the community that can be picked up at numerous other excellent sites; since, as Edsger W. Dijkstra put it so succintly, you should "only do what only you can do." But it seems that Aaron Swartz is on a personal crusade to do the sort of things that strike me with so much awe that I can't help but report on them here.

And so it is with his latest article, Mr. Swartz Goes to Washington. The thing that immediately struck me about the piece was the veracity with which the writing was conducted, the rhetoric (marvellous considering that the piece was clearly written whilst he was fatigued), and the shift between impersonal journalism, unfolding first-hand account, and splashes of humour. Hey, he probably gets it from Seth "perfect diction and sentence structure" Schoen.

But the most welcoming thing about the article is the underlying humanity that is Aaron Swartz. If you haven't followed the enormous battles over the government's attempts to "steal books from children" by extending copyright, then this might not be the best introduction, but even then it's easy to pick up as you go along. You're not left out in the dark at any time; it's going to take a saturnine world indeed to crush his child-like sense of wonder of everything, reminiscent of the Ruler of the Universe guy in Hitchhiker's Guide (the one with the pencil), and the empiricists' view of the tabula rosa.

And yet, like myself and a lot of people, he's drowning in a sea of luxury; a good internet connection, stability, and a hot dinner every night has been enough to deprive almost everyone that I know of that sense of awareness that comes from being on the edge. Aaron Swartz is not on the edge, and he's got a lot of potential and opportunities. But the 130 word note of thanks in "... Goes to Washington" is a good admonition to any thoughts of arrogance at his lifestyle.

As for eating your own dog food, Aaron, being less of a hypocrite than most, has dedicated the article itself to the public domain. It's easy, see? Just give it away to the greater humanity.

And I do believe that he is as close to selflessness as one can be in such a situation. I remember reading a chilling letter on www-rdf-interest from Joshua Allen, whom I respect to a great degree, that contained the words "I, personally, fear anyone who claims to be motivated by altruism rather than selfishness." I don't believe what he says in the slightest: I believe that one can be aware of a common good, and speak openly about it. Aaron is one of the first people that I could safely point to to back up my theory.

Now, if that doesn't put the blushometer up to 30, I don't know what will.

Where is King Alfred the Great?


King Alfred the Great is considered by many to be the first monarch of a unified England, commanding as he did all of Mercia, gaining many allies from the west, and setting up the Danelaw. He's the only monarch of England ever to have been given the title of "the Great", and Sir. Winston Churchill considered him to be the greatest Englishman ever.

So, you'd expect him to have a grave befitting his renown, right? Well, perhaps at first, but not anymore; no one actually knows where he's buried. The problem is, King Alfred was king of Anglo Saxon England between 871 and 899 AD, and an awful lot has happened since then to the Old Minster (and subsequently Hyde Abbey) at Winchester, which is where he was buried.

There was a considerable excavation of the site in 1999, but only an empty tomb was found, and according to the summary of the excavations, the research team believe that the remains are lost forever. As an outside chance, though, there is one possibility:-

In 1863, John Mellor excavated the site of Hyde Abbey and claimed to have found King Alfred's bones. Few records were kept and considerable doubt exists over the find. The bones were reburied in St Bartholomew's churchyard, marked by a stone slab with a simple incised cross.
Tours: Alfred the Great

There's a picture of that stone slab in the Winchester Cathederal Environs page.

Digital Style


As language changes, characters change too. The last major change in the English alphabet was the derivation of "j" from "i". The difference between the two characters was first distinguised in middle of the 16th century by the philosopher Pierre de la Ramée. In fact, a long consonantal "i" was used by the Romans, and an "i" with a long tail was often used by writers in the middle ages to serve the same purpose. In the twentieth century, characters like the interrobang (1962) sprang up, and typography continued to progress.

Representing these characters digitially, whilst bringing many benefits, is problematic for the lack of flexibility. Before the days of iso-8859-1 and Unicode, solidii were used after vowels in ASCII to represent the grave and acute accents. As for the charm of Shakespeare's Signature and Dickinson's Dashes, as soon as you reproduce them—even as a graphic—you lose something valuable. Hockney started to produce digital art partly because it took away that step of reproduction; each printout was an original.

But as technology evolves, we may get some of this flexibility back. A quick search for "digital paper" on Google yielded an article called Are we ready for digital paper? Betting on the "six months distant" digital paper and handwriting recognition technologies is a big gamble, but it does feel as if it's a logical progression, and one which could bring charm and style back to writing.

Make, Store, Break, and Find a Page


archive.org's old press release page seems to have been moved:-

HEAD http://www.archive.org/wayback/press_kit/

404 Not Found
Server: Apache/1.3.26 (Unix) PHP/4.2.2 mod_perl/1.27
X-Powered-By: PHP/4.2.2
Connection: close
Content-Type: text/html
Date: Sun, 13 Oct 2002 14:57:07 GMT

But, happily enough, it's still in the archive itself, dating back to 2001-10, so my trusty bookmarklet came up trumps again. Still, although we can quickly forgive them, what does it say when the largest Internet library in existence can't maintain a few bytes of data on their main site to stop links breaking? Even a redirect would have sufficed. cf. Cool URIs Don't Change.

I Wonder...


...how many insects are in the average pound of insects? When did PICS become the Semantic Web? Why didn't Vauxhall compress the name of their tyre valve cap tool down to TVCT? If the Web Archive and Google can host X terabytes of data, isn't memory cheap enough that everyone in the world could have practically unlimited free Web space? Where is that extra electron on a buckyball—does it hang off the end, get delocalized, alternate between two and one, or what? Isn't it ironic that the "inventor" of spam, Jakob Nielsen, has now patented a method to filter spam (source: ASw), or am I just cynical?

Answers on a postcard to sbp@miscoranda.com.



"The eschewal of the excretal is not eventful, but the espousal of the ethereal is eventual."—i18n -> ixviiin, by Dean A. Snyder on the Unicode mailing list

Cracking Contraptions


If you didn't already know by now, Nick Park and co. are creating a series of ten short Wallace & Gromit films, the first one of which, Soccamatic, is available on the Web.

The BBC's homepage for the series contains a little typo:-

The triple Oscar©-winning creator of Wallace and Gromit[...]

I think they mean "Oscar®". This got me thinking: how long would a word have to be before it's copyrightable? (And if it was an encoded version of DeCSS, would I still be able to internationally ship it if I printed it on the front of a t-shirt?)

Anyway, Soccamatic is brilliant, and put together, the shorts should make some good watching. There's a trailer of them available from AtomFilms.

Middle English, Modern Times


In my recent browsings, I've come across sites detailing a couple of interesting dialects which have relations with Middle English. The first dialect is Yola:-

Yola was a Middle English dialect that survived until c. 1820 in County Wexford in Ireland
English Dialects

In fact, Yola borrowed a lot from Irish. It's a shame that it petered out when it did. As John Cowan put it: "[i]t really cheeses me off that English's closest relative should have died out *just* before the invention of historical linguistics, so our records of it are, well, somewhat lacking."

The second dialect I found is that of the Black Country:-

The dialect of the area remains perhaps one of the last examples of early English still spoken today. [...] Other pronunciations are 'winder' for window, 'fer' for far, and 'loff' for laugh - exactly as Chaucer's English was spoken.
Ow we spake

It's funny how some regions can be so isolated as to preserve parts of a language extinct everywhere else, and I suppose that since globalization, this will be occuring less and less. If anyone else knows of any dialects derived from Middle English that flourished on into modern times, feel free to let me know.

Digging up Transcripts


The Supreme Court question Mr. Waxman in the Reno vs. ACLU case of 1997:-

QUESTION: I take it then that you would also defend the constitutionality of a statute which, tracking the words we have here, prohibited indecent conversations on a public street with minors present --
MR. WAXMAN: I think that --
QUESTION: -- or between minors.
MR. WAXMAN: Well, I think that a municipality certainly could.

Mr. Waxman, it turns out, used to be the Deputy Solicitor General of the Department of Justice, and is counsel to the MPAA in the Eldred case.

Turing Completeness in the Oddest Places


XSLT is Turing-complete, as is Befunge, and, scarily enough, Conway's Life. In fact, someone even built a Turing Machine in CL, which made me chuckle for most of the day after finding it.

Of the three systems mentioned above, Befunge is probably the easiest to implement in a high-level langauge such as Python—but just imagine what it would be like to implement in CL. A while ago, I proposed that a near-optimal test of programming insanity would be to implement XSLT in Befunge in CL.

Of course, all of this was before I found out about Wang tiles.

XMLNS Prefix Cataloguing


When mixing XML languages, you often end up with a mess of namespace declarations. Schemata can take some of this mess away, by letting you either fix a namespace declaration so that the namespace prefixes will be recognized post-validation, or set entities which you can use in place of the URIs.

That's useful to people who hand edit XML, but schemata also have a legitimate utility to browsers since they define which attributes have XML ID datatyped values. There was a proposal for an xml:id attribute, but the idea never really got off of the ground.

Anyway, it's a nuisance requiring people to implement fully validating parsers just in order to be able to use the two functions outlined above, so why not create a very simple format which just does those things? For example:-

<xpc xmlns="http://example.org/2002/xpc">
 <bind prefix="" ns="http://example.org/myLang"/>
 <bind prefix="html" ns="http://www[...]/xhtml"/>
 <bind prefix="m" ns="http://www.w3[...]/MathML"/>
 <XMLIDattr ns="http://example.org/ml" name="id"/>

Providing a link from an instance to the XPC profile could be just as straightforward, although you'd have to declare at least one namespace to be able to provide the semantics for that link.

<doc xmlns:xpc="http://example.org/2002/xpc"
   xpc:profile="myProfile.xml" >
 <html:p><hd id="blargh">Blargh</hd>

Or perhaps a default XPC profile could be retrieved from the namespace via. a RDDL catalogue.

Overall, this is a bit of a throwaway idea, but since syntactic validation is most useful as an authoring tool, I wonder why a user oriented solution for the XML ID problem hasn't been deployed yet.

100 Greatest Britons According to Google


The BBC conducted a survey of the 100 greatest Britons, and will be giving the results on their "Great Britons" show. They've already published the list of nominees, so I ran each of the names through Google and sorted them by how many results each person got. The results are on their own 100 Greatest Britons According to Google page.

It will be interesting to see how this compares to the BBC's list.

Revelling in RegExps


I enjoy coming up with interesting, but pointless, RegExps such as the following: r"([01])(?:&\1|(?:(?<=0)&1)|(?:(?<=1&0)(?:[^&01]|\Z))|(?:(?<=0)\|0)|(?:(?<=0\|1)(?:[^|01]|\Z))|(?:(?<=1)\|0)|(?:(?<=1)\|1))". But even that one still doesn't take the prize for coolest RegExp away from r"^(?!1?$|^(11+?)\1+$)".

BProxy: An HTTP Proxy


BProxy logs the time, URI, and title of HTML pages, that you browse. I've wanted this for a while, since Aaron released his archiverProxy (which archives the content of every page that you go to).

I may extend BProxy to do other useful things, such as filtering out annoying scripts, getting rid of the referer header when I don't want it, and so on.

It's called "BProxy" since it came just after "AProxy", which was a quick test proxy that didn't actually work.



Using five sources (all, BritishText, wordsall.txt, dictionary, web2) and a bit of bash and Python scriptery, I've come up with a list of over 380000 words. It does not include abbreviations, names, or hyphenated words. The OED 2nd edition has well over 600000 words, but at least this is a start.

I might expand it to include the words used in some of the works of Chaucer, and Shakespeare etc.

BProxy Revisited


I updated BProxy so that it can now filter certain sites (e.g. advertisements). There are several files on the Web that list advertising servers, but I'm just adding various servers as I go along.

I've also cleaned up the code quite a bit, and fixed a few bugs, such as the fact that carriage returns were being left off of some headers. I've been testing it quite rigorously, and it's standing up well for general use.

cf. the original announcement.

Alphabet Soup


Article, Indefinite (PDF; sorry), bee, C Major, Tweedle-Dee-Dee, e-mail ("Mr. Tomlinson, come here; I want you" may have made it), f's for fricative (voiceless labio-dental fricative, indeed), Gee-Gee, 'aitch, i (root -1) in Euler's Formula, Homer... [pulls back bush] Jay Simpson!, 'k, oll korrect, ell, em ('n' en), en(voy), the one dressed as O, Proof that P, Q, R Project, S is for GNU, according to Google (2002-10), Tee, History Of, u'Enhanced unicode constructor', Ved, Double U (wonderful), X, Generation, whI: unknown origin, zee and zed, Þ is a letter too.

HTML Wordcount


wchtml is a short Python script for getting the wordcount from HTML files. It strips tags and comments, and forks off to wc if it's available, or counts a simple split on whitespace otherwise.

It provides a count without quoted (<blockquote> or <q>) sections too, if it finds any.

Inflatable Military Devices


Well you learn something every day, so the cliché says. Today I learned that the military use huge inflatable tanks in their exercises. And not only that: one of them has gone missing, blown away in the U.K.'s recent gales. The BBC has the story:-

On Monday a military spokesman said they were anxious to hear from anyone who may have woken during the morning to find a tank in their garden.

It made me wonder why they don't use these cheap substitutes in battle to scare the enemy; perhaps they're too easy to blow up.

I Into J, Analogue into Digital


The letter "j" was first distinguised from "i" in the middle of the 16th century by the philosopher Pierre de la Ramée. In fact, a long "i" was used by the Romans, and an "i" with a long tail was often used by writers in the middle ages.

I was just thinking that it's a shame that character-evolution is probably all but over now that everything's being digitised. Not only are we losing the joy of handwriting—from Shakespeare's Signature to Dickinson's Dashes—but also those things that take centuries to evolve, such as the ampersand.

Oh well. Perhaps digital paper will become fashionable.



"If I write a weblog and no one reads it, do I exist?"—Can anybody hear me?, Shelley Powers

Well, I guess I heard her. Shelley's post was written in response to Anil Dash's courageous piece about his struggles with mental illness. He urges everyone to open up, and discusses the value of Weblogging as therapy.

For some, the connection between health problems and writing may be there, and I realise that those for whom clinical depression is a daily occurance can't just say "hey I'll lighten up today": it's a very serious thing. But for me, personally, there's not much of a connection, and, moreover, I feel that there's an inherent danger which I'll get to in a couple of paragraph's time.

I should quickly note that I've suffered from agoraphobia and anxiety for some years now, but I have not been shy to tell people about it (which is very unusual, since most people with the same condition—and it affects roughly 1% of the population—suffer in silence). It's nothing I'm ashamed of, and it's nothing that I have to hide; in some ways, I have more pronounced skills because of it, although it's a long term battle.

Whilst agoraphobia leads in depression in the majority of cases, however, I've not found it to be a problem. So, from my experience in coping, and noting most sincerely that I do not mean to belittle those that really do have a problem (such as Anil), I think that too much introspective nonsense from Webloggers is in poor taste to say the least. With hot meals, cable, and a fairly good Internet connection, surely wallowing in self-pity is insulting to the truly poor and desperate? My problems, and the problems of most people that I know, pale into comparison with third world hunger, sufferers of mental and physical abuse, and so forth. I think that anyone mired in decadent first-world luxury needs to put things into perspective.

As for getting an audience for therapy through mental health writing, once again I'm glad that people with serious problems have an outlet, but it's a frustrating that the signal to noise ratio is so poor. When I write here, I give my words to Google at least, and so I try to approach specific things: linguistics, programming, and anything else that might benefit equally those people coming across it now, or finding it through Google or some link in the future. This post, of course, is a major exception; I hope that Aaron won't mind the wasted bytes on his server.

Therefore, I support Anil's idea, but I can't endorse needless cries for attention, or vacuous and insipid narcisism that people might take it as an excuse for. The line between is hardly thin, but it would have been easy even for me to stray into it.

Anyway, that's my summary of feelings at the moment. I sincerely hope I haven't offended anyone—I haven't set out to do so—and I must stress again that this is all just my perspective at the moment, and that's I'm subject to errors and to a change in opinion. I should note also that I'm not directing this against any specific person (except perhaps myself), especially Anil, whose post took a lot of courage, as I've already noted.

Feedback is welcome; sbp@miscoranda.com.

Chauvinism vs. Neurosis: the Professor Wins


A Literary Exercise for you Intellectual Types, from Society for the Preservation of Clue [sic]. You know it's going to be good—I've linked to it even though it has nothing to do with Python or the Semantic Web.

Thanks to Kel and Melissa for pointing this out.

OE, AS? Anyway, put it Online


I came across The Complete Corpus of Anglo-Saxon Poetry a while ago, but don't think I noted it down here. It's a bit scary that the entire poetic works of a language could now be collected on my hard drive, but I'm sure that they must have missed some texts out.

It'd be great if the world's academic institutions could organize an Old English Corpus—all of the extant works—perhaps with scans of documents...

Oh, and if you don't have a CSS implementing browser, or happen to be Googlebot, you can check out something else. Just a little experiment (view the sources).

You Know My Position (Look up the Momentum)


Uncertainty About the Uncertainty Principle. A good article on Heisenberg, WWII, and how the uncertainty principle is faring these days. (via Seth, via Ian).

Process of Elimination, Joyce Style


You tollerday donsk? N. You tolkatiff scowegian? Nn. You spigotty anglease? Nnn. You phonio saxo? Nnnn. Clear all so! 'Tis a Jute...
—Finnegans Wake 16.5

Via a .sig of John Cowan's.

The Mouse May Already Be Free


The Free The Mouse campaign may receive a (not so) serious boost from the discovery of a 700 year old painting in an Austrian church:-

Siggi Neuschitzer, manager of the Malta Tourism Association, said: "The similarity of the painting to Mickey Mouse is so astounding that the Disney concern could even lose its world-wide copyright licence. [...]"

Shortest Python Quine?


This morning I scribbled down, pen on paper, a Python Quine. I hadn't actively looked into Quine techniques before; it was nice to start from scratch. Here's what I got, with a little reduction when I typed it up (my scribbled version had used a function instead of lambda):-

print (lambda s:s+`s`+')')("print (lambda s:s+`s`+')')(")

At this point, I went searching for the shortest Python Quine that anyone had come up with. I found Seth Schoen's challenge again, though it doesn't actually contain the code for the Quine that he came up with. Searching elsewhere, I found a couple of sites (The Quines List, The Quine Page) that had basically the same Quine (one credited to Manus Hand, the other to Frank Stajano):-

_='_=%s;print _%%`_`';print _%`_`

It was shorter than mine (since it assigns first and then prints), but I noticed that it could easily be made shorter still. So here I present what might just be the shortest possible non-empty Python Quine, at thirty characters:-

_='_=%r;print _%%_';print _%_

It seems odd that the original authors didn't use %r in the first place (perhaps they were going by a version of Python that didn't have %r; when was it added?). For total authenticity, you should include a line break after it. Putting a comma after each %_ would stop the line break, but make it two characters longer, and therefore longer overall by one character.

As for my print-first approach, I managed to shorten it to the following:-

print (lambda s='print (lambda s=%r:s%%s)()':s%s)()

Which is fairly respectible, at fifty-two characters.

I wonder what the shortest possible multi-quine is? (Check out David Madore's investigation into Quines for a definition.) Also, I wonder if it's possible to prove that a Quine in a given langauge is the shortest possible for that language, other than by using the evidence that no-one has come up with a shorter one yet? Feedback most welcome.

Python IRC Logger


I just wrote a quick IRC logger in Python. Requires Dan Connolly's ircAsync module to work. The log format is configurable: just mess about with the strings in the top of the file. Should be easy enough to work out. Released under GPL 2 or later: share and enjoy.

How to Write Like a Fucking Idiot


A new article of mine, How to Write Like a Fucking Idiot, has been published in issue 7:16 of CoN. Have fun ROFFLEing.

HTML Considered Harmful—Again


Hixie's Sending XHTML as text/html Considered Harmful seems to conflict with the (non normative, but HTML WG consensus) XHTML Media Types note, which states that XHTML 1.1 SHOULD NOT (not MUST NOT) be sent as text/html, and that ("HTML Compatible") XHTML 1.0 may be sent as text/html.

I tend to use XHTML, and I like to play around with XSLT screen scraping and such, too. I don't validate every page that I create (tsk, tsk), but I want a decent HyperText editor that'll do it for me. And yes, I know about Amaya.

So Hixie and Mark Pilgrim and others are reverting to HTML 4.01, eh? Well, that's no magic solution either. After all, this re-design of the Google homepage that I hacked up is a valid HTML 4.01 document (try it!):-

<title/GS/<body onLoad="document.f.q.focus()"<form
action="http://google.com/search" name=f/<p/<input
name=q< type=submit>//

Does anybody know of a single browser that with render the document correctly? Such features aren't permitted in XML (since it said NO to SHORTTAG). Go figure.

Fragment Redirects


Just a brief sketch, in JavaScript:-

<script type="text/javascript">
var fragmap = {};
fragmap['#code'] = './code/';
fragmap['#faq'] = '/faq/';

if (location.hash) {
  if (location.hash in fragmap) {
     location = fragmap[location.hash];
// -->

Obvious drawback: only works in JavaScript enabled browsers. In fact, even then you have to refresh the page (in Phoenix 0.5, at least) to get it to work.



{}: surprisingly, it's about gender issues, and not what would happen if Sigur Rós had a dislike for large mathematical sets.

More Marvellous Miscellaney


As you may have noticed, I'm starting to update miscoranda again, and this time it has a nice new clean style, a custom built backend (no more Movable Type!), and will be covering mainly home-spun projects instead of snippets of boring Web news.

This dry period now officially over, I can start to update this thing more efficently. I'd written a little Python script, publish.py, that updated Movable Type, but it wasn't all that compatible, and I was running into problems. So I spent a couple of days hacking just briefly on a small Python script that could do the same job. Once I've updated publish.py, all I do now to publish to miscoranda is save a file using a stardard link, and then run publish.py.

I've been working on a number of small but hopefully useful projects, mainly to do with the Semantic Web, but not all of them. For example, I've written:-

And there's probably some other stuff lying about the place, too.

So, in general, I hope to keep track of the little bits that I publish to various mailing lists and so forth, the probjects that I wouldn't otherwise have a forum to mention, and anything else that takes my fancy to mention at the time. Enjoy!

WyPy Python Wiki


Written in Python, wypy may be the world's smallest wiki at just 23 lines of code. I wrote it months ago, but have only just released it. Also: the smallest wiki contest, and the wypy source code. Released under GPL 2: share and enjoy!

And yes, I know that months ago I said I'd update miscoranda a lot more often. But I've been a) busy writing the world's smallest wiki, and b) busy procrastinating from writing the necessary miscoranda interface code. Busy times. But hey, I posted this, so that's a start.

u scripts


I glue my computer together with a lot of utility scripts. A lot of Python utility scripts, mainly. And I have an odd convention for naming them: instead of just throwing them all in a directory somewhere and then adding that to my path, I created a script that proxies access to the scripts. It's just two lines of sh, and it's saved as "u" in my /bin:-

ONE=$1; shift 1; /home/util/${ONE}.* $*

You may have realised by now that this is insane. It means that instead of using the command, say, "cathtml" for my HTML concat program, I have to type "u cathtml". Well, yes, but consider this scenario: I create a program called "splunge" and use it a lot, referring to it in bash scripts and so forth. Then along comes the next Larry Wall or Guido Van Rossum and creates a very popular "splunge" language. Oh dear.

Now, you'll have spotted that this is very unlikely, and that there are already clashes in the *nixy world, and they're dealt with. Well perhaps, but firstly, I happen to believe that adding "u " isn't too much typing tax; and secondly, I happen to believe that not accounting for an improbable event solely on the grounds that it's improbable automagically raises the probability of the event tenfold.

Let me guess; if you had to use my computer, you'd soon be saying "fuck u"?

FOAFQ and Related FOAFiness


If you've been following the FOAF lists, you'll know about my FOAFQ query service CGI thing already. It lets you do little-language searches on Libby Miller's newly-crawled FOAF corpora, such as weblog of sbp. Simple, and it breaks a lot, but it shows the potential of the system given a good interface.

Since then, I've had a few very nice comments from some very nice people, some of whom I knew, and some of whom I know now. Example:-

"Sean B. Palmer's Foaf Query Service kicks mighty ass. Well worth a play with."—Ben Hammersley

I also got one fairly indifferent comment, though it's in French, so at least it's pretty.

"Pas de possibilités énormes pour l'instant"—Jean-Julien Claudon

(There are no enormous possibilities for the moment.) Actually, it's right on target: the current form is my impression of how one should be able to query the FOAF Web. Jay Fienberg's attempt at a different interface is the sort of valiant forward motion I'm hoping for. It's not that his way of querying (drop-down form) is better than mine (little-language) or vice versa, it's that we need to try out all sorts of approaches for the maximum benefit.

Extensibility Framework


I've been producing an Extensibility Framework for the new Atom (Pie/!Echo) syndication format, with Ken MacLeod, Sjoerd Visscher, and Sam Ruby. We've got Python/SAX and XSLT implementations already, and there's a RELAX NG Grammar for it, too.

But what does it do? It's a reusable XML structure that has minimal impact on serialization, and is directly transformable to RDF/XML, and parseable to triples. It's also an alternate, non-striped, XML serialization of RDF. We've mainly been developing it on #echo, but the little feedback that we've had so far has been very positive. E.g., Asbjorn Ulsberg: "I just have to say; I love it". Since it's reusable, it's possible to express things like your FOAF files in the Extensibility Framework.

We've already applied some obvious design choices, realising, for example that rdf:resource in RDF/XML is redundant. It's been interesting, but it needs to have a lot more work done to it.

Python RDF Parser (& Improvements)


rdfxml.py is an RDF/XML parser written mostly in one night and in under 10KB of code. It's been released under GPL 2 and the W3C's software license. If you've been following www-rdf-interest or #rdfig or anywhere like that, you'll know all this already—sorry.

Back in the day, Aaron Swartz set up an RDF/XML <=> RDF/N3 conversion form under the aegis of the Semantic Web Agreement Group. Since then, that form has been thoroughly broken, so I thought it'd be a nice idea to come up with a replacement. I hadn't realised that it'd already been done by the mindswap group, so I ploughed on regardless. In fact, I'd like a form that does automatic type checking anyway, so it mightn't be a bad idea.

First, I tried to work out CWM's API. To my great surprise, I managed this. But it wouldn't work on my server due to some arcane problem. So then I tried a command line interface, using popen. No dice; this time there was a problem with the XML parser. SAX couldn't make a parser. Great. So I updated rdfxml.py a little (on my hard drive—I've got some other things planned) so that it can optionally use Python's old but interesting XML module, xmllib, whilst implementing some of Dan Connolly's suggestions.

It turns out that I was able to reuse some of the parser for the guts of the AtomEF parser that I wrote yesterday, and it highlighted a lot of the differences between the two RDF serializations. Then again, my much-maligned and forgotten XENT parser is probably the simplest bit of RDF-in-XML serialization parsing code. The catch, of course, is that not all the information is as strictly XMLized or XMLised as it could be...

As for rdfxml.py, there's still a lot more cleaning that could be done, and I'm not really sure if anyone's using it enough to need those changes. So it's not a huge priority.

Google Your Computer


It amuses me that I can search around three billion documents on the Web in roughly a second using Google, but it takes five minutes to search my hard drive. I know that a lot of my data is binary, and I know that my search architecture is for something completely different, and I know that with proper indexing a Google-like search would be possible. I know all that. But still it takes me ages to search my hard drive, and still I can search the entire engoogled Web in seconds.

In 2003-02, I wrote a script called findword.py. It indexes pages, and then lets you search them in a rather Googlish manner. Ignore what it says on the findword.py page—it's actually very fast when you ignore the time it takes to load the index. If you have the index constantly in memory, then you can get freakishly good searching, but otherwise, I'm not sure how to approach it.

Of course, I could always export my hard drive to Google...

21:05:21 <sbp> .google SW Hints and Tips
21:05:22 <xena> SW Hints and Tips: http://infomesh.net/2001/08/swtips
21:05:36 <sbp> heh, I can use Google to search my local hard drive!
21:05:41 <AaronSw> heh!


Reforming RDF/XML Part I: ID vs. HREF


And I don't mean reformation as in a bunch of 80's nobodies getting back together to make some money from a couple of half-sold pub-sized concert venues.

<mortenf> sbp, you wrote in http://miscoranda.com/47 that " rdf:resource in RDF/XML is redundant " - how so?
<sbp> mortenf: all instances of <prop rdf:resource="URI"/> can be replaced with <prop><rdf:Description rdf:about="URI"/></prop>
<mortenf> yeah, ok
<mortenf> so rdf:ID could go as well?

It could indeed.

I hadn't myself realised that rdf:resource was redundant until Sjoerd pointed it out to me, but now that I think about it, I'd already read a beautifully arcane TimBL proselet on the related rdf:ID vs. rdf:about issue, the most understandable-to-regular-folk part of which I'll quote here:-

Ah. Now consider what is the difference betwen reference and definition? I conclude there is none, as both are the assertion that the resource in question is identified by a URI.
Thought process behind implcit definition, Notation3 DesignIssues

When we published the Extensibility Framework, we added an @ref attribute that does the job of both a referencer and a definer, as TimBL put it. We immediately got a) accused of reinventing xlink:href, b) accused of reinventing rdf:about. Which is true, if you include rdf:resource and possibly xml:id and xior:xoid too. Because, and repeat this like a mantra if you like, there's no difference between reference and definition.

The point that I'm trying my best to avoid having to summarize is that whilst there may be some conflation somewhere in the logic, @ref is a valid concept that dispells now nearly decade-old myths that saying "there" is any less authoritative than saying "here".

People and Their Fragment Redirect Code


It's time to check out what some other people are doing. Let's start with Morbus. It's best to start with him, because he's working on a new secret project. Which means he's not doing much else—which means we're not writing an epic horror story together. Do you know what happens when you take a Python zealot and a Perl zealot, each with a prediliction for perverse prose, and make them write a story on a Weblog? Well hopefully you'll find out soon, and, if you don't, at least I've put our bizarre idea on record now.

Now for qmacro, aka D.J. Adams. No weblog updates for a month. Are you okay out there, D.J.?

Aaron Swartz. Heh. Aaron's busy working on Atom, and numerous other things that I've probably got no idea about. Aaron has a thing for djb. Well, not a thing for him as such, but an admiration. And I've got a thing (not thing) for Aaron. I'm really hoping that djb has a thing for me, but I have a feeling it's not going to happen.

Aaron gets two paragraphs because there's two of him. The other him goes by the name of Ash, and hangs around on IRC abusing me. He's not reading this, so there's no point in me abusing him back—I ought to do so on IRC in a moment.

Antonio Cavedoni: sorry for not posting your JavaScript update thing yet. Hey, actually, I should do that now.

<script type="text/javascript">
var fragmap = {};
fragmap['#code'] = './code/';
fragmap['#faq'] = '/faq/';

if (location.hash) {
  if (location.hash in fragmap) {
// -->

There you go; it's an update to my Fragment Redirects post that now works in Pho^H^H^HFirebird. Thanks, Antonio!

Terje Bless is currently taking submissions for a new domain name that he'll probably be running a weblog or a new version of the W3C's validator service on, or some combination thereof, more likely. Of course, since he hasn't got a Website yet, you'll not know how to contact him to submit a name, and so he'll probably never end up with a Website. I wonder how people with Websites get around that?

Ken, Sjoerd, Sam: thanks for your patience and ongoing efforts with Atom and RDF.

Anyone else I've forgotten to mention: sorry!

Don't Obfuscate; Obfusk


Sometimes, perhaps only hypothetically, you want to obfuscate a password on a system. Perhaps it's a trusted system but you want an extra level of security. You could base64 encode it, but that's pretty obvious to anyone stumbling across it. So why not obfusk it?

def obfusk(s): return ''.join([chr(ord(t[0]) ^ ord(t[1])) for t in
   zip(s, (str(len(s)) * ((len(s)/len(str(len(s)))+1)))[:len(s)])])

Yes ladies and gentlemen, the code above it a self-reversing obfuscational device intended to make your passwords unreadable unless you have the function above. Without the function, or knowledge of the function, it'd be pretty hard to retrieve the original string. Here's an example of its use:-

>>> obfusk('something')
>>> obfusk('JVT\\MQPW^')

Now, you're thinking that since I've bunged this online, its value has decreased somewhat. True, but there are plenty of variations that could be thrown in: think added constants, reversing, and sha1 mixing. It's just a bit of quick hacking, anyway.

Incidentally, I'd feel silly releasing a two liner under GPL 2 (even though what with wypy and my RSS 3.0 parser all my code will probably be only a couple of lines long in future). So the only condition of its use, in the laughable event that anyone should want to use it, is that the function or method name be "obfusk" so that I can meme-track it using Google. Thanks.

RDF Path, and RDF API Rumours


I've published some Ponderings on RDF Path that I've been scribbling down. RDF Path languages have tended, to date, to be incompletely specified and murky. This one is still under-specified, but I think it solves a lot of the murkiness that has arisen from people conflating (or ignoring) the node and arc difference.

I also sent a message to www-rdf-comments regarding an N-Triples inconsistency, and got a reply from Dave Beckett under two hours later. How's that for service?

Both of the above were spin-offs of an RDF API that I'm working on, in Python, of course. I've already rumoured it a few times, notably on #rdfig to bijan, who seems to be thinking about using it. It's the next step on from the interesting Eep3, built with the same goal of usability. It's got some interesting stuff in it, but I'm wondering how best to document it and deliver it. It needs a catchy name, too.

Whilst True, Reconnect


Ever find yourself on a Windows box with sporadically disconnecting dialup? Ever needed to just nip out and do something, having to balance it with the fact that wget -c will be sure to break as soon as you leave the room?

Fear no more. You can just use the following VBScript; rename "YourConnection" to the name of your dialup connection, save it to whatever.vbs, and run the file.

set WshShell = WScript.CreateObject("WScript.Shell")
const Key = "HKLM\System\CurrentControlSet\Services\RemoteAccess\"
const SubKey = "Remote Connection"

While True
   ' If not connected to the internet, try to connect
   If WshShell.RegRead(Key & SubKey)(0) = 0 Then
      WshShell.Run "rundll32.exe rnaui.dll,RnaDial YourConnection"
      WScript.Sleep 500 ' Allow the DUN interface to load
      WshShell.SendKeys "{ENTER}"
      WScript.Sleep 30000 ' Allow thirty seconds to connect
   End If
   ' Wait ten seconds before checking again
   WScript.Sleep 10000

It checks the Windows registry every ten seconds to find out whether the box is connected or not, and if not, it'll automatically dial the selected connection and wait 30 seconds for it to connect.

A simple bit of glue, but rather useful. Enjoy.

How to Thatch a Panther


My latest article, the fanciful Feline in the Fields, is out on CoN.

I quite enjoyed writing this one. I changed styles quite a few times whilst writing it, and so the result is that it's a little patchy, but hopefully that adds to the humour (it's why I didn't smooth it out). There are some little references just under the surface that most people won't get, I'm sure, but it's readable as a short subject-oriented story. Enjoy.

The One Where I Try to Name an RDF Toolkit


I've not been working on my RDF toolkit for about a week, and though I've got plenty of things going on to make up an excuse for having dropped it, I think that a big part of the reason is that I've been unable to choose a name for it.

Here's my shortlist:

Without a name, there can be no module names, no URI for the toolkit, and no alpha-page. The name is very important—it's the first thing that people are going to be introduced to when they come across it, and it's the label which they'll use to refer to it. It has to be pretty good, and I'm getting Atom-esque jitters about the nomenclatural nuances.

At the moment, I'm quite a fan of PyRDF—but is it too generic? What happens if people want to port pieces of the toolkit over to other languages? I'm not sure that I really want "Py" in there, since it smacks of the name vs. address problem.

I was thinking about setting up a little HTTP GET voting system, where you the reader could just click on a link to let me know which one you'd prefer, and then I could simply grep my server logs to get a count... but I really want to hear rationale rather than a stack of votes. One of the most irritating things about naming Atom (which may well be called Nota if the current vote is anything to go by) is that voting seems to be a bandwagon process rather than anthing clearly thought out—people see votes from a famous weblogger, and they add their name just because.

I have a few requirements: it must have a low googlecount, be short, and sensible. The googlecount is for trackability; it's hard following comments and feedback. It's handy to have a short name from experience—locally, I'm just calling it "rdf", and that's been handy. It has to be sensible for obvious PR/marketing type reasons.

If you have suggestions, feel free to send them in, but I think I'm just venting here. Perhaps I'll try to expand on the shortlist, and then if nothing really stands out, I'll use PyRDF.

Text vs. HyperText


One characteristic of the HTML that I use for the majority of my Web pages is that it's very simple. Here on miscoranda, for example, in the body of posts I only use p, a, em, strong, code, pre, dl, dt, dd, ul, li, and blockquote. Since HTML's verbosity makes it unwieldy to edit by hand, using a poor man's hypertext language (PMH for short) is an attractive alternative. If you've ever used a Wiki, or Perl's POD format, or atx, you'll know what I'm talking about.

But with respect to browsing, HTML source and PMHes are generally unconsumable formats. To use links, it's best to load HTML in a browser. Formatting in a browser is also more user-friendly: I tend to prefer rendered HTML's "blargh" to PMH's "*blargh*" or "''blargh''", etc.

So there's a problem with modes. When editing HTML, I'm bound to use a text editor. I'm writing this post as raw HTML source in a text editor. But when consuming HTML, I do so in a browser. TimBL, in his Editing User Interface has expounded upon why having both edit and browse modes is harmful, but it's obvious really: it's an annoyance.

PMHes can actually be seen as a mini-solution to the mode problem. It's more friendly to read than HTML source, and it's easier to write, too. But it's not rendered HTML, and so it's not optimal.

For an optimal solution, people have tried two approaches: GUIs for editing HTML, e.g. Amaya, and Through The Web Editors, or TTWEs. GUIs for editing HTML don't solve the mode problem if they can't be used as a primary browsing method, so we can discard Amaya as a potential solution. This leaves TTWEs.

TTWEs are, often, JavaScript applications that one can use in one's favourite browser that make the pages editable. With a bit of hacking, it is possible to make all the pages on one's hard drive, and all the pages on the Web in user-editable space actually editable through the browser.

This is powerful, but it makes me slightly squeamish. The principle of Code Shui says that you shall make your source code readable. Leaving source code up to a TTWE means that the TTWE's formatter had better be pretty good, and most aren't.

So the conclusion to this mini-rant is that text editing is not going to go away for a while. Which means that it'd be nice if text editors supported PMH munging and saving via FTP, and HTTP POST/PUT...

More Exispeciferous Words


I invent a lot of words. And usually, I lose track of these words, because I create them for use in conversation or articles and then move to the next project so fast that I don't note down what words I've created.

Usually, such words are just derivations of existing words, so that their meanings can be easily construed. It's the best way to gather acceptance for a new word if you want it to be used on a wider scale.

Most of the following fifteen words were created by me very recently, and some of them are just fun & whimsy. Some have regular synonyms, and are therefore useless. But perhaps one or two have value. Numbers in parentheses are the words' googlecounts.

activitometer, n. (2)
A device which measures the level of activity within a system. Original context: "an activitometer would be so great".
albii, pl. n. (1,270)
The plural of the word "album". Original context: "Albii are complete units".
amibientic, adj. (0)
The adjecteval form of "ambience". Original context: "an amibientic dimension".
arbitraritarianalism, n. (0)
Nonce word created as the nominalized form of "arbitrary". Original context: "oh, the arbitraritarianalism".
enmirthulate, v. (0)
To make something mirthful. Original context: "I enmirthulate myself".
fudgeulated, adj. (0)
Pertaining to that which has been kludged, hacked up, invented as a stopgap, or broken. Original context: "fudgeulated character string".
gemheap, n. (1)
A collection of valuable objects; originally: smorgheap.
idyllatry, n. (4)
The nominal form of "idyllic".
insultation, n. (3,090)
A stream of insults. Original context: "fed up with this constant insultation".
oftenlyish, adv. (0)
Nonce variant of "oftenish". Original context: "I do that oftenlyish too".
oodlefull, n. (0)
A lot; a great quantity of a thing. Variant spelling: oodleful. Original context: "I can provide that by the oodlefull".
prenth, adj. (34)
Historically rural. Example: "The place is renound for being prenth, but progress has been quite kind".
quietage, n. (124)
Variant of "quietness". Original context: "sorry for the email quietage".
smörktacular, adj. (0)
Nonce synonym of great, spectacular.
verbular, adj. (47)
Inclined to use verbs. Original context: "it makes people less verbular, so I guess so".

Are You Taking Notes?


I take a lot of notes. I jot them down anywhere that's handy, creating a plethoric field of URIs, dates, todos, recipies, chords, quotes, diarizations, ideas, WikiNames, birthdays, phone numbers, authors, email addresses, ISBN numbers, man pages, bash commands, Python code snippets, incomprehensible scribbles, phenomicizations, and shopping lists.

Sometimes there are patterns between these things. They mainly get lost. It's all just content management, and there are thousands heaped on thousands of approaches to dealing with it. There's the Web, for one.

But I feel that I should be able to do better than what's out there at the moment, and so for months I've been working on approaches to dealing with such information; from input to processing to exposition. Ironically, of course, whilst I've been doing it I've actually not used any more complex a note taking program than a date-stamping echo, or a basic run-of-the-mill text editor. But the aim is to provide something which is as usable as possible.

It's a complex area which has baffled me at times, and delighted me at others, and I think I ought to start sharing some of the fruits of what I've been doing.

The first two notes programs that I created were called b and n. Not particularly thrilling names, but there you go. First came b, which was a very simple notes structure with some date handling features and so forth, and next came n which was similar but based on RDF, and way too complicated. Note (heh) that both of there were in the proto-stage of my research and deevlopment in this area, and so they're quite primitive.

The b documentation and b source code, and the n documentation and n source code are available on the Web for your perusal. They have such short names since they were designed to be used from the command line. At the moment, I favor an HTML forms interface, since it's something which will work on practically any computer, and can be used over a network.

A Pient of RDF


Silence descended upon www-rdf-interest. Its inhabitants stood watching, waiting, wondering. The wind whistled wearily through the #rdfig environs. All throughout the community, there was focus upon the voices. Just the voices. And the voices spoke aloud, and they were resolute...

"We don't like RDF/XML's syntax!" "The extra features in N3 scare us!" "What if there were a subset of N3 equivalent to RDF/XML?"

The silence quickly returned. Coming forth to envelop the community like a sea mist about a haunted town.

Unfortunately, it was the silence of non-action, so I've taken it upon myself to produce a kind of N-Triples/N3 hybrid, and named it Pient. There's nothing like the ESW wiki for drafting up a grammar, and since it's a wiki you can go and edit it and help and update and provide comments or whatever. Enjoy!

Of course, there are no test cases or implementations or anything like that yet, and no one really seems to have noticed it, so it'll probably not be taken up any further.

Opus 61: More Metanotes


Karl Dubost, on #rdfig: "My information is lost everywhere."

Having shown the world the primitive n.py, and having been introduced to the advanced GNOME storage and Dashboard, I thought it might be a good idea to outline my current state-of-the-art when it comes to making information findable using notes programs.

The most interesting piece of code I developed since n.py was a mini-server that would take my simple datestamped lines, and make them available via HTTP. Its main feature was that it'd hyperlink common words to notes-wide searches for those words. It could also let you go back and forth through notes, provide summaries, and so forth. Its power was in the fact that it was an interpretative thing—it didn't modify the notes or require that one added any metadata. And that's a big lesson; people don't like adding metadata, especially when it's derivable from the content.

I'd like to have had a scripted SVG interface to it.

My current plan is to have special syntax for metadata, but I don't want it to get in the way. I want each note entry to have a bag of special tokens. Special tokens could be dates, URIs, etc. relating to the entry—anything regexpable that has a special syntax. It's possible that these could just be derived from the content, but at the moment I'm using "@:" as a seperator. Using "{}" to interpolate is another possible choice.

There's always a tension between regexping and wanting to do Natural Language Parsing. Learning a little-language or something like lojban would be possible, but rather an immense strain on the user! And that's taking it too far, but identifying and preserving the patterns is certainly still the goal.

The special tokens would be attached to the note via a property for each type of token. So, for example, if you were providing a todo item, you might input: "write a letter to John Peterson @: todo next-tuesday". That could be stored as {(0, "write a letter to John Peterson"): {'type': ['todo'], 'date': ['YYYY-MM-DD']}} in Python, where YYYY-MM-DD is the normalized date. You might also want to preserve the date as it was originally entered.

Searches would be conducted by setting up constraints, so you could search for all notes within a particular date range, that contain certain words, and so forth, ANDing and ORing constraints together.

It'd be nice to make old posts editable with revision storage, and have searches linkable so that they act like folders do in filesystems, but that's getting towards being like a wiki. Combining a wiki and notes program is a design I've considered, but then I've also considered a LISP/transclusion style design too, so you can't take it too seriously.

Note that b.py and n.py are slightly confused about where to store config data; it's like they naturally want to store it along with the data, but it would probably be better to seperate it out into a configuration file. Karl pointed me to a little ConfigParser to dict hack he wrote in Python that'll come in handy. I've also been wondering, though, about using an RSS 3.0 parser. It's frustrating how large interesting applications like this are often made of small tedious components. Not that I mind config parsing (it's actually one of the nicer things to work on), but date-munging is horrendous.

This all reminds me a lot of TimBL's tangle... "How much wood would a woodchuck chuck if a woodchuck chuck wood chuck chuck chuck wood wood chuck chuck chuck"—Weaving the Web.

Dictionary Expansions


I mentioned Tangle in a previous post. Tangle was a program by TimBL that would encode links between character substrings, as far as I can tell, in bits of prose. It'd've been nice to use it on a dictionary—their whole purpose is to define words in terms of other words. I wonder just how much of an understanding of a language's vocabulary one needs for the ability to bootstrap further terms of that language using a dictionary?

Definitions are often, though not always, restatements of the meaning of a word in simpler terms. So it must be possible to expand some single complicated words into a sequence of less complicated words. There'll be a loss of meaning and connotation, but the principle of simplified English has been widely researched and is really just too close to the NLP problem.

For example, "John defenestrated Bob" could become "John threw out the window Bob". The grammar's off, and this is mainly hypothetical, but the principle's clear.

The quick thought that I'm just trying to scribble down here is that of a system for measuring the threshhold for definitions' complexities. You could count how many times a word is used in other definitions, and give it a commonness index based upon that. Then, in a certain piece of prose, you could expand words that are below a certain index, and keep expanding their definitions until hopefully you were only left with words below that index. If a word already expanded appears in some level of expansion of its definition, then you've got a loop and you could break there.

Many bonus points to anyone crazy enough to implement or have already implemented this.

Schema-Aware RDF Editing


Posted as an email to rdfweb-dev: Schema-Aware FOAF Editor. The general tenet is that it makes it a lot easier to edit RDF files when you take into account information from the schema, e.g. cardinality restrictions, domain and range constrants, and so forth. You don't want to have to enter a quoted "literal" when you know that the range is rdfs:Literal. So I basically decided to write a program that can edit from the command line to test out the approach.

Also: blogged on #rdfig, RDF Miniserializations (the lead-up to the editor).

Note that I've also implemented cardinality checking and rdfs:label-usage since I posted the email to rdfweb-dev.

The code that I've written is in Python, and it uses the RDF API that I've been working on, which is now named pyrple. The editor code will be released after pyrple... but you can write to me to ask for a demo, if you really want to.

Flag Day Problem


In cramming date-stamped notes into a file, it's better to operate on those notes by entry number rather than date. So you might have a notes file that has three entries:

2003-10-15T15:17:19	@@ do the washing
2003-10-15T15:17:25	project: do something cool
2003-10-15T15:17:37	@@ use "todo" instead of @@?

And when displaying, you'd like them to display as:

0) @@ do the washing
1) project: do something cool
2) @@ use "todo" instead of @@?

So that now you can specify an operation such as $(update 2 s/^@@/done:/) to mark a todo item as done. Great.

Now the flag day problem is simply that I think that once you've entered 10000 notes, using an index number is annoying again. So you might want to have a flag day every year where you start from 0 again. You can reference old notes by doing year:index. But that means you get a flag day problem of having to so year:largeindex for notes made at the end of the last year—and that's annoying because at the beginning of the new year, those notes will be very recent.

One could specify an overlap period, i.e. have a flag day that changes posts from 6 months ago backwards, but that basically means that indexes will have to change, which may be irritating.

Just noting the issue, really. Comments welcome.



The RDF API that I've been working on, pyrple, has been released. Google doesn't seem to have picked it up, but it did appear on the Daily Python URL so a handful of people have come across it.

It allows all sorts of low-level RDF munging: it's able to parse the usual suspects (RDF/XML, N-Triples, N3), provide a query interface in the API (I've experimented in bolting SquishQL onto it), and have all sorts of RDF tools built on top of it. I've already made a schema-aware RDF editor, RDFe, an OWL syntax checker, an RDF diff tool, and so on.

At the moment, queries don't allow constraints, but I have written the constraints code, and it's just a matter of hooking it up. I've written the code from the point of view of trying to make everything natural. For example, to load a file with a URI into a graph class no matter what its serialization, you just do Graph(uri="http://example.org/rdf"). Likewise, graph isomorphism can be tested using G == F.

pyrple also aims to be minimally interdependent, so that parts of it can be reused more easily in other projects, and so that the source itself is easier to follow.

There are some interesting parts to the code: the singleton type in namespaces.py, the use of __new__ in node.py, and the general craziness of graph.py. I've not had much feedback, which isn't surprised since I've not really been promoting it due to the beta nature of the code (it's solid, but the API may change), but I'd appreciate any comments.

Weblog? Oh, That Weblog...


Sure, I forget miscoranda once in a while. But with good reason: I'm a busy person, you know.

Nontheless, I'm taking time out now to tell you what'll be coming up if you watch this space. For a start, I'm abandoning the miscoranda-is-the-center-of-the-universe idea; Weblogs really aren't for me, as you may have gathered by now from my neglect. Instead, I've been working on a wiki.

I managed to write wypy, the world's smallest wiki not so long ago, and all was fine until some guy beat it with a 19 line wiki in Ruby. Ruby? What on earth is going on? Mine was in Python, and the one that beats it is in ruby. Hasn't anyone ever heard of perl? But I digress. Python's fine enough, and with a little help from a friend (ahemaaronswahem) I managed to get wypy down to 18 lines. Expect a release as soon as Morbus's post on DNN about his LibDB project dissipates into the background. I like to commandeer his weblog when he least suspects it, and when I've got a half-good idea or thing to promote. It's nice having ties into a popular blog.

Speaking of half-good ideas and popular blogs, I've an article coming up which I think may be good enough to merit a link from--wait for it--Aaron's blog. Oh yes. It's that, in my mind, good. But we'll just have to wait and see about that. I'm sure Aaron won't be reading this, so I can happily spring it on him as a surprise: "Aaron? Coverage? Please?" I'm such a moocher sometimes.

Now, the other interesting point that I have to make in amidst this rambling nonsense is that this article is hosted on a site that uses yet another wiki that I've made. The wiki's name is pwyky, and it's a huge single-file Python wiki that might turn a few heads when I release it. So there are three things coming up in total:

Again, I'm just waiting for DNN to clear...

Oh, and happy new year to anyone that I forgot to say happy new year to!



Not one, but two new wikis, have been released today. The first is the 18-line wypy, an update beating the previous 23 line version of the same wiki, and bringing it back to the top of the Shortest Wiki Contest.

The second, more exciting one if you're actually looking to use a wiki, is the fabled pwyky. Crafted carefully over the previous couple of months, I've taken care to ensure that this single-file Python CGI is as easy to install as possible. All you need is Apache and Python, and the file practically installs itself. There's even a pwyky installation online for you to try before you install.

I'm using pwyky to power a site of mine already, and I'm finding it a pleasure to work with. If you install either of these wikis, or have any comments, please let me know!

Who Needs RCS?


I do. But rcs, as we all know, is not an enjoyable program to use. The solution to the lack of enjoyment amongst version control software is to reinvent the wheel, and roll-your-own inferior piece of code—and in as few lines as possible, to make it a challenge!

I decided that the lowest common denominator for version control is that you have a timestamped copy of your work at points which you, the author, choose. A program that copies a file into the same directory with an appended timestamp fits that requirement well enough, whilst being the simplest thing that could possibly work.

Here's the first program that I came up with; a mixture of sh and python:

cp $1 $(python -c "import sys; n=sys.argv[1]; \
        i=((n.rfind('.')+1 or (len(n)+1)))-1; \
        print n[:i]+'$(timenow)'+n[i:]" $1)

Note that timenow on my setup invokes date -u +%Y%m%d-%H%M%S, since I use it fairly regularly.

I'd challenged myself at the time to do it in straight bash, though I couldn't think of a particularly easy way without leaning on sed, etc. But then, thanks to some spurring on from Jill, I got it:

EXT=.${1#*.}; if ((${#EXT} == ${#1}+1)); then EXT=''; fi
cp $1 ${1%%.*}-$(date -u +%Y%m%d-%H%M%S)$EXT

And it's shorter, too! Your challenge, if you accept it: make it even smaller. You can use perl if you like. If you want to add features, you can make the program backup to a dotfile hierarchy, and possibly gzip the contents. Answers on an email, as ever.

Pyrple Update


Pyrple, the RDF API that I've been hacking on in Python, is now available in 2004-01-26 flavour. The main difference is that it now supports builtins (like CWM) via a builtins module which is quite a nice piece of work.

MySQL support is carefully being hacked in by Deelan, who may well eventually release a database-enabled branch of pyrple.

Pyrple was started as an API that I could use in various applications that require a datastore with query functionality, but it turns out that I've only developed a handful of such applications, so at the moment it's lacking impetus that I can use to develop it further. I still, however, consider it an active project. If you've put it to any use, it may be quite a boost for me to see it in action elsewhere, so do let me know.

Sean B. Palmer's 115th Semantic Dream


I dreamed about a chemical wedding in which XML and Unicode gave birth to SGML, and Jon Bosak was the best man. I dreamed about a fox and a pigeon, and a stack of triples that reached into the heart of Helen of Troy, and I sighed when I was told that she was not for me.

I dreamed about seven RDF tools, and each RDF tool supported seven formats, and each format supported seven models, and each model supported seven forms of reification. I dreamed that they were all coming back from St. Ives. I dreamed that a man called Clay had eaten all 5 exabytes of human knowledge and did not want to find a restroom.

I dreamed that a friend of a friend bought me a rose which died before she even bought it. I dreamed a dream of the world telling Berners-Lee that he'd have to wait his turn, and Tim handed Aaron a book and Aaron sighed. I dreamed that we went insane and no one cared, and that he wasn't insane enough and no one cared, and that I wasn't insane enough and no one cared.

I dreamed that in time past I found paradise but that its gates were locked and I was on the inside. And the man at the gate said that I was not to help anyone on the outside, and he made me angry and fearful and desperate. And I shouted through the gates and was threatened that I'd be thrown out. I saw all those things, and will never forget them.

I dreamed that I went to Dublin, Ohio, and that someone held a mirror to my head and demanded that I describe everything in the room, but I could not see the room and I could only barely make out myself in the mirror. I dreamed that the mirror spoke to me and said that I'd save the world, but the mirror was a liar, and it was only me in the mirror anyway, the mirror wasn't real.

Then I dreamed that I walked into a unicode bar, and all the 'phones had U+8728 written on them. I tried to put in a U+00BC but the 'phone still did not work. And only a guy named U+0394 could explain it.

I dreamed that ten thousand were drowned who never were born, and that each of them had a weblog. I dreamed that I'd learned Chinese, but Ash and Achewood told me that it's impossible, and anyway, Jill could interpret for me. And I dreamed that maybe she could tell me about Al and Len, and then I dreamed that even they couldn't.

I dreamed that an unicorn had exploded in Taiwan, and no one cared. I dreamed that April is the cruellest month. I dreamed that it was the 21st of April, and that someone special was gone—and I woke up, and it was true.

And everybody had gone. But there was still RDF.

And Now For Something Completely Different


The thing in the middle is a fountain.


It looks as if it's been crayoned by a primate, but actually it was produced using GIMP, which is no reflection on the greatness of that tool.

Tomorrow: RDF newsflash! We reveal the top 10 ways in which RDF is being portrayed in the media. Whatever you do, stay tuned.

RDF Project Cornucopia


To all my RDF tinkering friends:
Whilst my procrastination won't abate
And with this verse I can but make amends
I'm sure to this you can at least relate

The coming up with projects, as you know
Is sometimes harder than to implement;
Upon the fabled stack we're still quite low
And one must be so novel to invent

Therefore I give now here a concise list
That you might work upon the tasks therein
Or if there's anything you feel I've missed
Feel free to write it up and send it in

Maybe one day I'll follow all these through
In meantime, though, I've ceded them to you

A path engine in Python would be great
(For pyrple best, though I may overrate)
Or if it's challenge that you want from me
Then port afon to JavaScript and C!
A validator unlike Rosco's kind
Is something else that briefly comes to mind
And BNF-in-RDF a-show
Whyn't make a script converting to and fro'?
Or since Notation3's so hard to write
A syntax checking script would make work light.
For web browsing in RDF to be
A proxy we would surely have to see
Or transform may be just as good as well
Making from RDF HTML.
And so already we this list conclude
Hoping that there is naught else to include

I hope of value something here you found
Though doubtless there was little too profound!

Stenographic Haiku


Being the cartoon of the day:

Stenographic Haiku!

It's based on the fact that I made a stenographic haiku the other day; the rest you can work out from the cartoon itself. I am also not sure what "implemention" is.



"HELLO ALL OPEN SOURCE SOFTWARE DEVELOPERS: flexibility and plug-in architectures are a PAIN IN THE ASS! Yes, I know they're cool, but DON'T lead your project description that way."—Dan Connolly on #rdfig.

In response to yesterday's piece about RDF projects, Libby Miller raises the issue of schema generation. She witters on about it a little in an ESW Weblog entry.

In other news, some things are easy to explain:

<verbosus> I've been reading your latest stuff on miscoranda and on Morbus' blog
<sbp> hehheh. poor you
<sbp> though Morbus' cat food thing was funny
<verbosus> At first I thought you were gone mad.
<verbosus> But then I realized one can't get more mad if he already is.
<sbp> yeah. what particularly set off the alarm bells, though?
<verbosus> Nothing in particular.
<sbp> phew. well the painting was a bit of a low point, I thought
<sbp> but there was a nefarious purpose to that too: I wanted to see how many hits I was getting from planetrdf.com users
<verbosus> So, how many?
<sbp> and I think that with the sonnet and unheroic heroic verse, I've regained some sort of not-as-low territory
<sbp> don't know yet. my logs get cycled daily; they're on Aaron's box and I can't read them until just gone midday tomorrow...

The correct answer is: negligable.



I'm constantly asking for comments via email, and though I get some, I figure that I'll get more with a comment form for each post. Since miscoranda's backend is a customized set of Python CGIs, I had to hack it on all by myself; but it only took about an hour. The nice thing that I found was that I store entries with their fields seperated with newlines, and the most important information towards the top... it saved time over having to parse it as RFC 822 message headers or, worse, RDF.

Aaron really needs to update to a newer version of Python on this server, though.

Wy.Py - Eleven Lines of Wikiness


I've just uploaded an eleven line version of the wypy python wiki, which I think I can quite safely say is now at the moment the world's smallest wiki. In fact, it appears to be the world's smallest wiki by some margin: WyRiki, 19 lines of ruby, is derived from wypy, and the next two contenders (TinyWiki, 28 of perl, and Qiki, 32 of python) don't count because they go well over the 80-character-per-line rule.

The funny thing about working on the world's smallest wiki is that the more effort you put in, the less work it looks like you've done. Also, since I apply version numbers locally according to the amount of lines, it's one of the few projects where the version numbers actually go down. I'm going to have to start using character counts instead.

To make this version so small, I had to define the minimal wiki feature set. These are covered on the c2.com wiki, but there's isn't much detail; what, for example, constitutes minimal wiki formatting? I've taken it to mean paragraphs, WikiName links, lists, and HTTP URI links. I dispensed with allowing the module to have importable elements, though I did enjoy the "(__name__=='__main__') and main()" hack in the previous version, and trimmed the presentation right down. The pages should, however, validate if you add an HTML 4.01 doctype to them; I could've saved half a line or so, too, if I'd've taken out the rows and cols attributes of textarea.

Having worked on wypy for so long, it's starting to affect my other coding, and I look towards compact solutions everywhere. I've found that instead of making my code worse, however, it actually makes me think more about the code itself and the possible combinations that will work. The Python motto, "there's only one obvious way to do it", may not be true anymore. I also make sure that I only ever do the simplest thing that could possibly work, but no simpler—there's no point introducing a bug to save a line.

Aaron suggested that I simply write a gzip program and then compress the source to the wiki, inflating on the fly. But I could just submit wypy, or something similar, to be a part of the Python standard library, and so "import wypy; wypy.main()" would be the world's smallest wiki. In fact, I already noted on the contest page that a bash script downloading wypy and piping it into python would be a much shorter wiki... it's getting to the point where the contest is meaningless without lots of arbitrary rules. The situation reminds me of how chess is now that computers can thrash us all, and theory has been so thoroughly explored.

One thing I've learned, though, is that people come to rely on things in code that they simply don't need. Some of the previous attempts to build a "smallest" wiki were around 100 lines, and though there were some interesting bits of compressed coding in there, it's clear that once the authors started to build their wikis, they were proud of their efforts, started to use them regularly, and hence wanted more features. My approach was to make the most brutally small wiki—no more—and yet even I had many ideas about required features that I've reconsidered in the course of the year in which I've been developing it.

Still, making miniscule code isn't really about nihilism. The current release uses some rather interesting tricks to make it so small, including long strings of "and"s and "or"s to cast everything as statements. The best feature is probably the reuse of the "s" variable to mean both submitted content, and to start a search. Hmm... how does that work?

I'd probably have to take up Perl to make a sub-ten-line version.

Autoplurals Rock


Being today's cartoon:

Haiku is easy / Levitating sheep isn't / You are now wiser

Nodes and arcs everywhere! I wonder if anyone has done any work on representing X-bar structures in RDF?

All the Operation of the Orbs


Tycho Brahé, with the 1588 publication of De Mundi Ætherei Recentioribus Phænomenis amongst other works, started to dispell the notion of celestial spheres. He was, however, a heliocentrist. And yet Kepler founded his Third Law, helping further to dispell the Aristotelian view of the world, on the findings of Brahé. Descartes' Principia Philosophiæ of 1644, with its vortices, is at odds with Newton et al., but was accepted for a long time as standard.

I'm enjoying that Trithemius' Steganographia (1606) is heralded by cryptographers as a cryptographic tome that is really about angel magic, and by scholars of the occult as a treatise on angel magic that is really about cryptography. It'd be nice if someone translated it to English and put it online so that it would be easier for the casual observer to make up their own mind.

The title of this post is, incidentally, from King Lear I.I. Oh, and also, a stone is doubtfully said to have fallen into a boat in Copinsay, Orkney in 1676, but whether it was a meteorite or the imagination is hard now to tell.

Now Comes Music


Being the cartoon of the day:

Everything is of duck. This you must know. Now comes music. [F#]

Cody believes that this cartoon expresses "the philosophy that everything has the nature of a duck at its core, and some music", whilst Ash commented that "I think it's just nothing, mostly; it is like 'None' in python".

Now I have to learn to draw straight. I've heard that there may be a thing called a ruler which could help me, but I'm wary of new fangled devices, especially when it comes to art.

Utility And More


I've written many utilities which I keep in a folder on my $PATH. There are around 60-70 in total, and they range from those that are useless to those that I use daily. As an example, here's one of the more useless but insteresting ones:

Sometimes, you want to provide a few words of input to a program that will only take filenames. So you have to open up an editor, think of a filename, and then use it and remember to delete the file afterwards. Very annoying. Hence "tofile":

FILENAME=$(date '+%N')
echo -e "$@" > /tmp/$FILENAME
echo /tmp/$FILENAME

Thanks to Jill for %N. Usage example: $ cat $(tofile "hi\nhow are you?"). I think my original use case was testing "comm".

This post then is somewhat about the fact that I have utilities floating about on my computer that I don't tend to release because I'm not sure where to release them to. This is a problem which affects basically everone that I know, but it's rather annoying for me all the same and I think that packages would come and forestall any pony and trap that you could deliver.

I shouldn't listen to music as I write—I can't hear my words.

Lots of people have utilities, but I'm not aware of many sharing them. When I go through others' utilities, I usually only find a couple that I think useful anyway. Perhaps there are large sharing communities out there that I don't know of. I should at the very least bung a .tar.gz somewhere, though I'm concerned in case one or two utilities have a stray password or somesuch in them.

"[off]" is a pretty annoying comment string to use in a logger. I've discussed this in some detail with friends, but we didn't reach much consensus on the issue. "#" was standard for a while, but "!" or ";" seem like better contenders. ";" doesn't require one to shift, which is handy.

Nine Lines' Wiki


Learned perl. Wiki at nine lines. Called "pewi", thanks to Cody.

Now that I've ported it to perl, people can and are looking over it and telling me where I can make further shortcuts. The HTML and the regular expressions do, however, impose a physical limit to a great extent, so I'd be surprised if it were possible to get it down to 5-6 lines, even with lots of Perl magic.

Just Like A Woman


Being actually nearly smile-inducing if you get the reference:

The fog was so bad she could find neither her amphetamine nor her pearls.

I still haven't found a ruler yet.

The Man in the Restaurant


A man in a restaurant laughed. Then ensued the following communication between Sean B. Palmer and Aaron Mathews, on this, the 8th day of February in the Year of Our Lord 2004. Whereas, Sean B. Palmer's esteemed friend, Aaron Mathews said "no cartoons", Sean B. Palmer, startled from his reverie, thus replied: "no. it's about cartoons. metacartoons; and not even cartoon metacartoons". Aaron Mathews, his friend, likewise proclaimed in the following most singular of manners: "no. lies". It is this to which Sean B. Palmer almost-wittily retorted as follows: "metalies". Aaron Mathews was as sharp and as keep as a rabbit on this particular occasion, however, and was able to parry the thrust with a cunningly timed: "lalalala". There ended that most peculiar of sections of the communication between Sean B. Palmer and Aaron Mathews.

Today's cartoon is titled Man in the Restaurant, and is only accessible via the link which inhabits the earlier part of this current sentence.

Further cartoons will be announced here, but will only be accessible from the cartoons page on infomesh.net. No more clutter in your aggregator. Yay.

"stupid man in the restaurant"—Aaron Mathews

Perl Wikis, More Toons


The Shortest Wiki Contest has really been progressing rapidly since I started porting to perl, with Nick Cleaton's 6 lines, 357 character FleaWi currently leading the 7 line PeeWee and my own 7 line PeWi.

I've also published a couple o' more cartoon sketch things: Sharpen is a quick scribble about how the cartoons are made, and It's... Another Mouse! is an anatomical diagrammation of a mouse, for your perusal.

The first rule of miscoranda is that you do not post cross-subjects. For example, if I want to refer someone to a post about perl wikis, I want to refer them to a post that's only about perl wikis. Cartoons? Only about cartoons. Never mix topics. Also, never follow silly rules.

xd && atomToRss && ppr


Three interesting new tools have been released in the last couple of days. The first is xd, a hex viewer by Daniel Biddle that was written in just one hour after I challenged him to reproduce xxd in that time period. Source: xd.c, Makefile; $ make && make test.

The second is Antonio Cavedoni's Atom 0.3 to RSS 1.0 conversion webservice, which runs off of an XSLT stylesheet that he wrote.

The third is my own ppr, an experimental sed-like tool that lets you process text input line by line asserting length, stripping, quoting, and so on. It's a nice framework in which to add future commands.

The cartoon of the day is a maze: Centripetality.

Classificatiana I: 200X


I've started using a /200X/ folder on infomesh.net to hold miscellaneous data, eschewing the convention of having a new directory every year. That is to say that instead of having a /2004/ directory this year, and a /2005/ one the next etc., I'll be using /200X/ for the rest of the decade.

This was discussed on #rdfig extensively by myself, Dan Connolly, Aaron Swartz, and Daniel Biddle. The rationale for my decision is:

Antonio Cavedoni has argued against this, stating that "200X" is more difficult to explain to someone than a date, especially in Italian. There are no other characters that particularly fit, though /200-/ and /200/ are ones to consider. The old point that URIs should never be seen was also raised, but is bogus since URIs have to be transcribed into emails, spoken over the 'phone, and advertised on the sides of busses.

Other schemes, for examplehttp://inamidst.com/stuff/gmail/, have been proposed, but they similarly suck. Using /200X/ is aiming for the least bad solution. As Ted Nelson said, "hierarchies are evil".

This entry is the first part in a series on classification. What is classification? How does it affect our daily lives? What would we do without classification? Look around you. Just look around you. All of these questions and more will not be answered in this series. Do enjoy.

Porphyria's Lover


Scott McCloud was mentioned to me today, which made me browse his website and find his cartoon rendition of the poem Porphyria's Lover, originally written by Robert Browning in 1836. The poem interested me and sparked a number of questions: What were Browning's sources? How was his bright iambic octameter received at the time? Why is Porphyria named thus?

I googled, but instead of answers, I got an echo of idiocy that made me shiver. "Write the police report after the crime." Sure, that's going to excite students.

Moreover, I found another article eventually which wasn't so bad, and answered a couple of my questions, but it didn't explain who Alice Perrers is, so I looked on Wikipedia. And that led me almost directly to Corona, which makes this post: a linkfest.

I have a theory that weblogs fall into one of the following seven categories, no matter what you try to do to escape it: linkfest, retort, diary, bullshit, technology, bad humour, short pithy tract on Tycho Brahé. I was going to link these to example posts, e.g. retort to Aaron's recent Lessig report, but I chickened out at the last moment and instead leave it up to an exercise for the reader.

No cartoon today.



Jim Ley is an SVG and JavaScript maven, and constantly comes up with surprising solutions to problems you never even thought you had. If you've been following the Shortest Wiki Contest, you'll know that the code has been getting towards five lines these days, and Jim set out to beat it. This is what he came up with:

<b><textarea onchange="x=new XMLHttpRequest();x.open('PUT',location);
'<a href=\u0022$1\u0022>$1</a>')+this.parentNode.innerHTML);

It's an HTTP PUT wiki using JavaScript. Scary, no? It works, for some value of works, and you can test it out at the JimPUT site I've set up. The source code for the PUT handler in Python is available too. Note the duplication bug caused by parentNode not behaving quite as expected; there may be a fix.

But just imagine setting the HTML as your 404 handler, and then having a directory wide PUT handler... Backlinks don't work, but perhaps the handler could be made to do that—though then the code would be distributed, so either way it doesn't really count for the Shortest Wiki Contest. It may, however, deserve an honourable mention!

I must also just note that QuickPut, an old Python HTTP PUT command line tool that I used to debug the new PUT handler, rocks.



At three lines, Wikke is probably the edge of the Shortest Wiki Contest. It's a modified version of Jim Ley's JimPUT idea, mentioned previously here on miscoranda, made to conform more to the contest rules using PHP. From the announcement:

Wikke is a 3 line 206 character wiki clone in PHP, JavaScript, and Bash. It was written for the Shortest Wiki Contest by Jim Ley, Sean B. Palmer, and Adam Wendt, with help from Charles Goodier, Arcon, and Daniel Biddle.

The source is small enough now to fit into a sig block. Many thanks to Adam Wendt, aka thelsdj, for the use of his server in developing the code.

Weblog Design Shame


Today I thoroughly criticised weblog design on Swhack, writing up summaries of all of the blogs on Aaron Swartz's blogroll, and discussing the features that make good (and bad) blog designs.

The exercise was meant to be a personal thing for my own instruction, but a few people have become interested in the frankness of my criticisms already—and I've also been corrected in some notable areas. For example, Tantek pointed me to his post on why his style is like it is, which is a very interesting idea (he's redesigning popular styles in CSS).

I decided that oblomovka, daringfireball, and Mark Bernstein are my favourite designs. They all share a sense of balance between the colours and the typography and the layout that most neglect; they each have a distinctive feel to them that sets them apart from the others, and makes them memorable and easy to recognize. The other criteria that I decided are of most utility to designers are consistency and tone: if you use garish colours, or ones that don't match, then they'll be distracting. What baffles me is that this is obvious advice, and yet barely anyone seems to follow it.

I also observed the general trend that the bloggers that were more well known to me also tended to have better designs: though there were some surprising exceptions to that rule.

Overall, I started the study because I'm interested in equilibrious, non-intrusive and yet interesting, designs, and found much less than I thought I would. But design is subjective, and personal: for the projects that I'm doing, I'm looking for something as distinctive as possible, and that's one of those things that is very hard to learn.

Constitutional Loophole Uncovered


Aaron: The Constitution only says the state can't support the establishment of a religion.
Sean: By establishment does it mean inception, or the committee/rulers of a religion? Is it a verb or a noun?
Aaron: Anything promoting the religion
Sean: So a noun?
Aaron: E.g. putting a manger in front of city hall is not ok.
Sean: Only if city hall did it.
Aaron: Er, right. Yeah, noun, I guess.
Sean: But what if city hall decided that economics is so complicated that they needed to create a new language to capture its nuances? And what if a part of that language, one of the fundamental aspects of its grammar, was the act of putting a manger outside of the city hall building? Then it'd be protected under free speech, no?
Aaron: I think they'd be told to get a new language.

From #swhack. And also, Aaron: "I don't think the Government has a right to free speech; that's an interesting cross-constitutional issue".



I posted a couple of notes to the CWM lists, one explaining N3's @keywords, and the other detailing what I thought was an @keywords bug in CWM. TimBL gave a rather great reply. The interesting thing I missed out was that when it says keywords, it really means keywords: all keywords can be prefixed with an @ now if not explicitly declared with @keywords. The upcoming addition of "@prefix : <#> ." as a default is also most welcomed.



Adam Wendt's words of the month are: valetudinarianism, parturition, allelomorph, and coenesthesis. Your bonus word is coenobite. The invented word of the day is "forbidant", a noun meaning "that which is forbade".

The chord of the day is C#m, and the not-so-random link is blargh.

Strange In, Strange Out


Take ninety one (91) miscoranda posts, and feed them into one (1) markov-chain script, and guess what results?

I've just uploaded an XSLT stylesheet that he wasn't insane enough and no alpha-page. The name is very important—it's the first program that will only be accessible from the namespace prefixes will be covering mainly home-spun projects instead of making my code worse, however, it actually represents a certain index, and keep expanding their definitions until hopefully you were only left with words below that index.

I came across The Complete Corpus of Anglo-Saxon Poetry a while ago, but don't think the Government has a nice framework in which RDF is being portrayed in the future.

Definitions are often, though not always, restatements of the ground.

Now the flag day problem is simply that I was going to have found an 800 mile diameter Kuiper belt object that may be big enough to need those changes. So it's not going to happen.

You may have woken during the morning to find a restroom.

A while ago, but don't think I ought to start sharing some of the meaning of a unified England, commanding as he did all of Mercia, gaining many allies from the namespace prefixes will be sure to this mini-rant is that this is obvious advice, and yet interesting, designs, and found much less than I thought you were providing a todo item, you might want to add future commands.

I'm thinking about creating all of my future posts out of these markov-chained old posts...

Joyful Lasciviousity


I would first like to apologise again to Adam Wendt for calling him Adam Keys. With that aside, it's limerick, sayings, and poetry time. Enjoy!

A wanton young lady from Wimley
When reproached for not acting quite primly
Replied, "Heavens above!"
"I know sex isn't love!"
"But it's such an enchanting facsimile!"

via Jill Lundquist

She notes that the "facsimly" rhyme is the best part. But Londoners do it best:

Here did I lay my Celia down
I got the pox and she got half a Crown.

anon., Chancery Lane, 1719

And now a silly one to trail out with...

Mary had a little lamb
She tied it to a pylon
Ten thousand volts went up its ass
And now its wool is nylon

There. That'll do nicely.

Converting iso-8859-1 Into utf-8


Tricky, right? Not really.

import sys
for c in sys.stdin.read():
   if ord(c) < 0x80: sys.stdout.write(c)
   elif ord(c) < 0xC0: sys.stdout.write('\xC2' + c)
   else: sys.stdout.write('\xC3' + chr(ord(c) - 64))

To use: pipe iso-8859-1 in, get utf-8 out. Or here's a slightly nicer script which works using regular expressions, and can take filename arguments too:

"""Convert iso-8859-1 to utf-8. Sean B. Palmer."""

import sys, re; r_iso = re.compile('([\x80-\xFF])')

def iso2utf(s):
   def conv(m):
      c = m.group(0)
      return ('\xC2'+c, '\xC3'+chr(ord(c) - 64))[ord(c) > 0xBF]
   return r_iso.sub(conv, s)

def main(argv=None):
   if argv is None: argv = sys.argv[1:]
   for fn in argv:
      s = iso2utf(open(fn).read())
      open(fn, 'w').write(s)
   if not argv: sys.stdout.write(iso2utf(sys.stdin.read()))

if __name__=="__main__":

The main() idiom that I used is quite useful, incidentally. Let's hope that Karl finds this script easier than using iconv!

Favicons and Anticryptography


I've started to find favicons useful because they orient me as to which site I'm on and which sites are where, and because they're occasionally interesting pieces of design. So I set about the task of making one for infomesh.net. Here are some of the contenders:

i ii iii iv v

I'm currently using the second symbol, as you may have noticed. We've been coming up with lots of different ideas for what could go in the little 16x16 pixel grid, but one of the best suggestions was anticryptographic in nature. Brian McConnell's The Next Frontier in Computer Science is the first article on Google for the word, but the basic explanation is that it's the study of shaping data such that it can be interpreted with as little context as possible. It's normally mentioned in the same sentence as "when sending data to potential alien planets".

Cody (this is d8uv I'm talking about here) came up with what I deem the best anti-cryptographic method for encoding text into a favicon. First, we decided to use a character set containing only the twenty-six letters used in English, and the space: [A-Z ]. Then he took each pixel to represent one character, and set R, G, and B channels according to the following algorithm: multiply (255 / 27) by the index of the character to be encoded. Simple no? I failed to decode the example he gave me, which was about 10 characters in length, but I did manage to correctly guesstimate that FF was a space, given that it never appeared more than once in a sequence, did not appear at the beginning or the end, and was by far the most popular byte used.

Before that, I'd been working not on anti-cryptography, but on data compression, and had been studying reduced character sets such as Morse Code, Hawaiian, and a phonetic system based on shorthand. Cody also noted that having a randomly generated favicon would be interesting; I wondered whether a randomly generated and reduced fractal, or randomly crawled image from the Web wouldn't be better.

On Useful Redundancy


I always thought that having three different character escaping methods in HTML was a bit excessive. Being able to write &mdash; or &#x2014; or &#8212; means that you have to think about which one you're going to use, and it's usually a pretty arbitrary choice. Most people prefer to use named entities when they can, but there aren't named entities for everything.

But the redundancy actually came in useful whilst I was developing pwyky, and it resulted in a neat hack which I thought I'd mention here.

In ASCII, in email and IRC, I tend to use a double hyphen with no surrounding spaces as an em-dash. In pwyky, I wanted to use the same thing, and have it be displayed as a proper em-dash in HTML. Easy enough. But I also wanted to allow any unicode to be entered, so I'd allow {U+HHHH} for that. This means that there are two ways in pwyky to enter the em-dash: either "foo--bar", or "foo{U+2014}bar"; guess which one is the more popular? But the latter syntax is useful, because the former is only employed when the double hyphens are surrounded by word characters.

A problem arose in that both -- and {U+2014} are converted to the same bit of HTML, "&#x2014;". Pwyky works by compiling your text input to HTML when you save it, and then converting it back to text again when you go to edit it. But on the conversion back to text, it didn't know which form of entry was being used for em-dashes since they were being converted to a single form in HTML. Argh.

The obvious answer: use &#x2014; for "{U+2014}", and &#8212; for "--". I had to hope now that the standard HTML parser module in Python could distinguish between the two, and thankfully it did.

I wonder what I would've done had there not been such a redundancy in HTML? I probably would've had to guess that any situation where a double hyphen could be used should be converted back to a double hyphen, but it's not the ideal solution.

Blogger Admits to Making Posts Up As He Goes Along


New Zealand writer Kevin Hemenway today admitted to creating his entries for the Disobey Nonsense Network as he goes along. "I just didn't think it was a big deal, you know? Like everyone does it", said Hemenway when interviewed for CNN. "It's the sign of our times, and I'm the victim here. I just don't have enough time to write a year's worth of entries in advance".

The New Zealand equivalent of the FBI are investigating the matter, and will be deciding whether or not to prosecute Hemenway in due course.



Dr Michael Brown of Caltech has discovered the largest Kuiper Belt object yet known, now named Sedna. The bad thing is that the media are pushing for it to be called a planet just so that they have something to report on, and people are worried that Pluto will be demoted if it's not. But come on. Next you'll be saying that Ceres and Vesta (and Pallas) are planets too. I think I put it best on Swhack:

<sbp> It's totally a Kuiper Belt object like Quaoar, so it'll probably be dismissed.
<sbp> I mean, really. You're just giving too much for kids to learn.
<sbp> Back in the olden days when it was just Mercury, Venus, Earth, Mars, Jupiter, Saturn, Sun, Moon, people could dig astronomy, y'know?
<sbp> Then it was like, well, the Sun and Moon are too bright to be a planet, but we'll make up Uranus and Neptune to take their place...
<sbp> So everything was way cool for a little while more. But astronomers want fame, and the best way to do that is by discovering another planet. So several dudes discovered Vulcan, between the Sun and Mercury.
<sbp> But Vulcan's a pretty crappy name for a planet, so they're like, well let's name one after Mickey Mouse's dog instead.
<sbp> And kids, you know, they can still dig that because of that cartoon connection.
<sbp> But Sedna? What're they going to use to remember that? Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto, Sedna?
<sbp> Sorry, no. Stupid plan. Use another Disney character, K THX.

I want to be in an age of discovery as much as the next guyette or guy, but we're done with the Solar System's large objects. Let's move on to finding stuff around other stars now, unless there's a near Kuiper Belt object that's bigger than Pluto.

Meanderings About Editing


Today's been an interesting day already, with deltab providing the Television Tropes, Idioms, and Devices as a source of entertainment, and the source of a few patterns. One such pattern was up a meta-layer in thinking about the Tropes/Idioms wiki itself: its titles are often memeworthy, it recourses to a particular hyperbole phrase "used in every episode of..." frequently (every page!), and so on.

I had one of my brief but pithy conversations with DanC on #rdfig where I managed to get out an observation on abstract/introductory repetition which I can now cite when I want to bring it up. And there's the results of a musing on Swhack given the premise "rich nut emails you and gives you a practically unlimited budget to come up with your own OS".

But what I'm focussed on generally at the moment is the Seven habits of effective text editing article. Basically, it's thinly-veiled VIM propoganda, but it neatly exposes the problem that it's difficult to talk about editing without resorting to explaining it with a particular instance of an editor.

So I'd been thinking that it would be nice to write an article with the following premise: if you're a good enough programmer, then you ought to be able to systematically analyse the kinds of editing tasks that you're most likely to perform, and then build an editor around your own observations.

I know that listening to someone talk about User Interface is very tedious...

"I'm a bit embarassed, as everyone has pet ideas about how the UI is frustrating, and listening to them can be tedious, I know! Perhaps this is why I haven't written this down before."

TimBL, Editing User Interface

But such an article could assuge that problem by concentrating on design patterns that can be reused (which is where the wiki names thing from earlier comes in). For example, I recently decided that if I were to work on an editor for myself, I'd like to abstract all the events so that I could have seperate GUI (wxPython) and text-mode (ncurses) implementations, ensuring that I wouldn't have to adapt depending on whether I'm using a terminal or a GUI OS.

The main thing that one has to decide in such circumstances is whether the effort put into such a task is going to be more than offset by the benefits or not. Morbus has talked about this:

I'll drop everything I'm behind deadline on and spend 20 hours automating a task that takes five manual minutes; I know I'll eventually recoup the benefits months down the road, after I've long forgotten the automation exists ("you only notice electricity when it's missing"). I've automated iTunes album listings, video file annotations with AppleScript and Perl, have tried numerous todo and schedule outlines, and generally enjoy having my computer be far more "robot" than those of 90% of the populace

Morbus Iff, Failing Miserably, If Not Inventively

Morbus came up with a rather shocking (or simply catalystic, as he put it) piece of rhetoric the other day about productivity that gave both deltab and I food for thought for some time, but that'd probably best be left to another rant. Funny how it partially resulted in my finding out that the song Lone Green Valley is a derivative of Pretty Polly. According to Google, no one seems to have noticed that before.

Testing Gmail


April Fools turned into Christmas very fast. When you get given a gmail account for testing and told to use it, break it in, love it, report bugs, and talk about it all you like, you do exactly that. Screenshots first, explanations thereafter:

screenshot 1 screenshot 2 screenshot 3

Gmail only innovates where it needs to: threading as default view, the enormous space, use of labels/keywords instead of folders, and the ability to use Google's search power on your email are the main such features. The first screenshot is of the search functionality. Note that when you search for messages, it displays the whole threads that are found, not individual messages. The interface and is consistent in that everything is categorized and viewed based on metadata in the emails, or the labels that you assign. You can assign multiple labels per email, but mostly you aren't going to need them: now you can search for "to:(www-talk)" and have it return all of your www-talk mailing list emails. There's no longer really any use for personal folders.

Of course, there are still folders, but they're treated like special labels. "Inbox" is itself a label that's applied to all new mail, "Starred" is a message highlighting feature (essentially it's a label that uses a funky star logo instead of a name), and "Spam" and "Trash" fulfill their obvious functions. When you want to remove an email from your inbox (if you should want to do so), the process for doing so is termed "archiving" in the interface, and the email then becomes available only by clicking the "All Mail" link, or by searching for it.

Consistency and attention to the user's experience are rife: for example, the use of "Newer" and "Older" instead of "Forward" and "Back", and very quick JavaScript interface. Even the signup is clean and smooth, allowing you to write your own secret question for example. Sadly, you can only choose a username between 6 and 30 characters; I really wanted "sbp", but no one is allowed a short username.

The second screenshot shows a single email ready to be replied to. See the input box at the bottom? As soon as you click inside it, it morphs insantly into the view in the third screenshot.

In fact, the system is so straightforward that there isn't really a great deal to comment on. The adverts are very similar to their usual Google search counterparts, but they don't get displayed all the time. They seem to be more common the more text is in an email, which makes sense given that it allows for a more targeted ad-set. Privacy is a valid concern, but having used Google search for years, I'm sure they know enough about me already, and the concerns are no greater than for other Web based email accounts. At least, unlike certain providers, Google won't automatically spam you as soon as you've signed up.

I was worried about information integrity, but it would obviously be a very bad thing indeed if Google were to go losing their customers' emails, so it's likely that they're committed to gmail being fast and reliable. I'm told that there will even be a way for users to export email in case they want to switch to another system; it's not been implemented yet, and I'm not sure what form the export will take, but that it's being provided at all is odd given that it'll circumvent the adverts.

There are still some display bugs per the beta phase, so I can understand why Google would not want to make it public yet, though I do think that having released it before common availablity is a bit of a tease for us non Google employees—especially given the April Fools day metaprank. Nontheless, there are no bugs that prevent usage, and gmail is already so good that I've no qualms over using it for my regular general email straight away. Looking over the list of issues that I have already, most of them are actually feature requests: for example, it'd be nice to have the ability to remove labels that are different to the label view you're currently in, and I'd like a Thunderbird-esque quickfilter, though that would be hard to implement. But the nice thing about beta testing is that I can just report these things and they may end up in the finished product, so expect the released Gmail to be even better than I'm reporting it.

One example of a feature that appears to be under testing is the use of chevrons to indicate whether email is sent to the user directly, or sent for example to an email list. A ">>" indicates the former, a ">" the latter. It's a good idea, but only available by explicitly turning it on in the options; I think that they need to find a better set of labels for distinguishing 'twixt the two. It's the little things like this that have a cumulative effect that'll matter to users, and I think that the nice touches such as email remaining checked even when you move it about or re-search for it are from Google's staff actually using this for their daily email, and for similar reasons to those espoused by Gruber, namely in that with UI you get what you pay for. The joy and genius of gmail is that you pay nothing except having a few unobtrusive ads shown sporadically, and so everybody wins.

If you have any questions, feel free to drop by #swhack on Freenode. It's interesting how quiet Google managed to keep this, but now that the cat's out of the bag, there are other screenshots and comments from Google employees Kevin Fox and Jason Shellen which you've should've seen already.

I'll likely have a lot more to say once I've gotten even more use out of it, and the search facilities will only start to become useful with a lot more email, but by then I'm sure plenty of other beta testers will have emerged. Since I feel compelled to provide a summary, I'll just say that unlike Segway it's nowhere near disappointing, and if folders and convenience issues make you as frustrated as they used to make me, you're going to love it.

More Gmail Beta Testing


Details, details. Whilst most people are speculating about privacy concerns, spam handling capabilities, and the length of the beta phase of gmail, I'm still busy examining the minutia of the service.

I'll wax opinionated about some of the abstract issues in a moment, but enjoy some more screenshots and commentary first:

Screenshot IV Screenshot V

The first screenshot shows one of the fun mini-features, namely the green box and arrow to the bottom right of the window indicating what mails are hidden below on the screen. The Yahoo! mail advertisement is styled grey as it's considered part of my signature given that it's below the "-- ". If you omit a standard signature, however, Gmail considers the Yahoo! advert to be quoted text and hilariously replaces it with a small javascript link that toggles its visibility. It usually does this with text quoted in the normal ">" manner.

The second screenshot is what you get when you click on "Compose email", and shows how relevant entries from your personal contacts lists are displayed in real time, updated with each character you type. The JavaScript that drives all of these features is heavily obfuscated, presumably to deter automated interaction with the interface.

One of the most common class of questions that I've had is whether the service is suitable for professional use. Commercial business use is forbidden by the Terms of Use, but if you just want to hide the fact that you're using a free service, then the answer is less clear. Gmail doesn't append advertisements to outgoing emails, and seems unlikely to ever do so, and moreover it may not be against the Terms of Use to send email using a local email client, but that's not specifically addressed. There's an option in the "Settings" section that allows you to set the Reply-To heading, but nothing that will change the From heading—understandably given the nefarious purposes for which spoofing is most often used: spam.

I'm often asked how Gmail works with spam, but I'm afraid I've nothing to report on its spam handling capabilities since I'm being careful to ensure that I get as little as possible! I've had one false positive so far, which was a "Delivery Status Notification (Failure)" from an email sent to one of my accounts with an intermittent DNS, and I promptly marked it as not spam, which is very easy in the interface.

When the Spam folder is empty, it contains the comment "Hooray, no spam here!", and indeed each empty folder has its own little witty remark in it. Even labels can be empty in a sense, as you can have labels that have not yet been associated with any conversations.

I'm relying on the search box heavily, and finding it very useful indeed already even with my small quantity of email, but given that each person has their own different email likes, dislikes, conventions, and workflow, labels will probably be very useful for some. They're allowed to contain spaces, and they apply to entire conversations/threads, not just single messages. The latter property of labels is a little disturbing to me, but it makes sense given that Gmail is highly thread oriented, which is something that takes a little getting used to. I have a feeling that one of the biggest feature requests at first will be to have an option to turn threading off, but that over time people will come to wonder how they ever did without it. It allows for some elegant little UI features, such as the fact that when you reply to the most recent message in a thread, there's no subject editing capabilities by default, but instead there's a link to "edit subject".

Another common question that I've had is as to when the service is going to come out of beta testing. I'm not a Google employee, so I have no idea. With other services such as Spymac apparently starting to offer 1GB of space too, it may be prudent for them to think about doing so soon, but Google's greater media prowess might conceivably offset other services' trying to steal the limelight.

As to privacy, this all comes back to each user's personal requirements. For some, webmail is just not going to be an option for any number of reasons—privacy perhaps foremost amongst them. It's sad that almost every single article on Gmail has been far from objective, missing the fact that everybody has to make up their own minds about whether Gmail is right for them or not. Personally, as you'll've guessed from the euphoria interspersed between my comments on the service, I'm very much enamoured with Gmail, and I find it both useful and very interesting to beta test—but I can accept that many aren't going to use it.

I hope that my elucidation of its features will help people make up their minds a little more objectively, but I suggest waiting to try it out first-hand before settling staunchly on an opinion. You'll probably find, like me, that the more you use it, the more you'll learn about your personal email habits and not just about Gmail itself.

Pyllbox: A Weblog Publishing System


Announcing the first release of pyllbox. Pyllbox is a lightweight and robust weblog publishing system written in Python, focussing on the most essential aspects of blogging and making sure that the implementation is sound. While not oriented towards the absolute novice given that some knowledge of the Apache webserver is required, it's a nevertheless straightforward package, consisting of only a few CGIs and template files.

Pyllbox is used as the backend of miscoranda, so you can be sure that it can survive a slashdotting's worth of visitors, and it's released under GPL 2 should you want to add extensions or other augmentations.

Anything You Can Do...


I often get involved in little Secret Cabal email threads, and I thought it'd behoove me to publish the following excerpt from one of them, lightly edited:

Weblogs, syndication, XML, buzzwords, the Movable Type furore, licensing, spam filtering, politics, and the media [...]

My take on that - a lot of this stuff has got dumbed down to lowest common denominators, such as syndication as a hi tech way of delivering personal newsletters.

Right. And my problem isn't with the technology itself, but with Knuth's Nth system syndrome and people's obsession with angels on a pinhead discussions. As David Woolley put it on www-style recently:

the cycle is:

His example was for CSS and SVG, but this applies just as much, or more, to syndication formats. In fact, Atom had this entire cycle all within its own development many times! We had the Super Simple Feed Formats vs. the Extensibility Framework vs. Bray and Pilgrim's incessant domineerings vs. convoluted XHTML and so on. And if it goes to the W3C, it's going to happen all over again.

You could say that weblogs are boring too, but the problem is a larger and more complex one than that. For every blog, there's a different story, and I don't like to give credence to weblogging as an entity when that's the very thing I'm trying to avoid, so I tend not to rail against it in public. But there are certain things that as a general trend I like to steer clear of. Blogrolls, for example, are a throwback to primary school "and Jack's my second best friend, but Jamie is my bestest friend eeevar!"; no matter what the "information discovery" pretense used, I don't think there's any getting around the fact that if you spend more than five minutes devising your list, you're wasting your time.

I've maintained that the most important thing about publishing a weblog is a) the content and b) the presentation, and forgetting a maxim like that is a loss. But again, a thousand blogs, a thousand stories. I write a blog, and I even wrote the software that powers it. I enjoy reading about William's nighttime driving eyesight strangely improving, and that Danny's getting along well with Mozilla Thunderbird. Catching up with friends is alright, and there are many other valid reasons one might have a weblog. I'm sad that William and Libby don't blog nearly as often as they could. I'm sad that Deltab doesn't make his kind of wit available more widely.

So William's mentality that weblogging is conducive to having 10^10 people connected is right on, but the current blogosphere milieu is damaging to that plan. William himself doesn't follow his own espoused views because he's embarrassed by his works and he doesn't think anyone's following—that's what being a single fly in a swarm of millions will do to one, and that's no way to write. And okay I don't have an answer, and noets [sic] and wikis and all of those sorts of collaborative environments aren't quite what we need, but hopefully something will be forthcoming.

As I declared on Meatball: "Wikis are the new blogs!" And blogs are the new television. The question with each is: are the end results, new items found, worth the effort expended in sifting through the noise and signal mixture? Does the end justify the means? That's why Deltab doesn't play Minesweeper anymore, and that's why Morbus implores me to work on more interesting things. But sometimes you'll find a treasure trove ("thesaurus" means treasure, en passant) that's packed to the rafters and filled to the brim (I find it hard choosing between those sayings) with perceptive and novel information, like the T.V. Tropes and Idioms wiki.

The point is that, as the #validator topic declares: "we're people! we have feelings! Some of us!"—Some W3T drone; and as the #swhack topic declares: Better to do things than to meta-do 'em! Go and do stuff, don't meta do it.


Sean B. Palmer, 
"Anything you can do, I can do meta!"
-- Samuel Hahn, via John Cowan

Triternions and Quaternions


When I studied maths, I was interested in many topics not covered on our course. Amongst those topics not covered were complex numbers, about which I'd often ask my teacher, and he'd be happy to oblige. One day, I contrived the notion that if imaginary numbers extended the plane of numbers into two dimensions, it might be possible for another set of numbers to extend the plane out into three, or even an arbitrary number, of dimensions. I thought it'd be rather neat if there were a three dimensional equivalent of the Mandelbrot Set, for example.

Given that complex numbers are introduced via the roots of negative numbers, I asked my teacher what the root of -i was, but he said that he sadly thought it was solvable in complex space. He told me the answer next lesson, having researched it, and told me to derive the root myself, which I did. I promptly dropped the idea.

Today, after a conversation about complex numbers in Python, I researched some aspects of imaginary numbers, and accidentally stumbled across the notion of quaternions and hypercomplex numbers. I quote Wikipedia's Quaternion page: "Quaternions were discovered by William Rowan Hamilton of Ireland in 1843. Hamilton was looking for ways of extending complex numbers (which can be viewed as points on a plane) to higher spatial dimensions. He could not do so for 3-dimensions, but 4-dimensions produce quaternions." So it is indeed possible after all!

But I still wondered about three-dimensional numbers, so munging the term that Hamilton had come up with, I arrived at Triternions, and Googled for it. There are a few things on Triternions, including a nice piece by Jim Muth, but nothing that explains them to a novice in the field such as myself. For example, I couldn't find an equivalent to Hamilton's "i2 = j2 = k2 = ijk = -1" fundamental equation for Quaternions. If anyone knows anything further about the subject and can explain it to me, please do send me an email (sbp at miscoranda) or leave a comment on this entry. Thanks!

Do Not Adjust Your Browser


Yes, this is a miscoranda post. Good time zone independent greeting of the day or night to you, gentle reader. There's been lots going on recently, so I thought it'd behoove me to write about it somewhat, and wax indescribably on some other matters which are enticing me to write this entry.

I've been active in a lot of disparate fields lately, and bringing them all together has been difficult. I've written a few bits of interesting code—glean.py, jotbot—that if you know me well enough you'll already have seen demonstrated. I've been assisting the One Big Soup project to some extent and getting along well with Lion et al. And I've been writing three thousand word emails to Javier and Morbus about all kinda of topics from the paranormal to literature, teaching Suw python (@@ ask her for shorter archive URIs!), reading lots of interesting things, and generally keeping busy with many other tasks that're either too mundane or too personal to write about here.

So far so good. But from the point of view of a lot of fora, and most especially the Semantic Web and FOAF, I've more or less disappeared. That's something that I keep telling myself that I ought to reverse, but when it comes around to it it's rather difficult: unless you have the time to subscribe to the relevant lists and track down all the current hot topics, attend conferences and write code, you get left behind. It's a bit of a viscious circle since then you get dispirited and do less work in the field, compounding the problem.

Part of the problem is that technology in general is not looking all that attractive to me at the moment, though it always draws me back—just as well given that I'm a computer scientist. Nontheless, I have an enormous list of seed topics that I want to investigate and write about, and this is an urge that's been increasing throughout the year. Most ironically, however, the main barrier to my doing so has been a technical one. I'm rather a Nelsonite in that I'm displeased with all the forms of knowledge capturing software that we have available to us at the moment. I don't think that wikis or blogs or Amaya etc. are anywhere near optimal for the purpose of capturing the sorts of notes that I like to take, but at the same time any effort that I make to try to invent my way out of the paper bag as it were falls flat.

One of the attempts, jotbot, I've already mentioned. Jotbot, the code for which I haven't published since I'd like a better name for it (ideas solicited!), allows people to edit a wiki via a tumbler-esque/DOM line mode interface. As the -bot suffix betrays, it runs as an IRC bot and allows you to edit from an IRC channel.

It evolved from blogbot, which was a bot that enables you to publish to a weblog from IRC. Blogbot failed because IRC writing style doesn't translate well to smooth prose, and jotbot failed because it was overly complicated and IRC isn't good for perusing text. John Cowan had an idea about publishing emails that he sends to a blog, which I think is a great idea but it wouldn't work for me since I don't tend to capture too much of my work through email.

One potential solution I'm looking at is using caesura to indicate line boundaries in IRC input and hence reviving blogbot, but I have a feeling that it'll be unsuccessful too.

Random Utilities


I've set up a GET/POST gateway and HTTP HEAD service for people. I've also published the C equivalent of my iso-8859-1 to utf-8 conversion program as iso2utf.c, and my .shellrc file which I link my .bashrc and .zshrc to.

Enjoy. Comments welcome.

One Notation3 Tutorial, Extra Terse


The following is a Notation3 tutorial derived from IRC. It's most suited to people who know the basics of RDF and the RDF/XML serialization, but nothing or little of N3.

N3 is rather easy to get the hang of.
URIs: <http://example.org/>
bNodes: _:label
literals: "this is a literal"
qnames: ns:term
variables: ?x
To bind namespaces to prefixes:
@prefix pfx: <http://example.org/namespace#> .
Then it follows these patterns:
s p o .
s p o; p o .
s p o, o .
and by extension:
s p o; p o; p o, o, o, o, o .
The empty URI refers to the current document, hence:
<> rdf:type :Document .
As you can see, you may also use an empty prefix.
That must be explicitly bound, and must be done as follows:
@prefix : <#> .
(or to whatever URI)
bNodes can also be spelled [], in which case they have no label. For example:
[] rdf:type foaf:Person, rdfs:Resource .
The properies and objects can be put inside one of those bNodes—really that's a special case:
[ rdf:type foaf:Person, rdfs:Resource ] .
The main keyword is "a" which means rdf:type, but there's also => which is used in logic.
# Comments use hashes
RDF lists can easily be done using parens, for example:
:subj :prop ("p" "q" "r") .
which is equivalent to:
:subj :prop [ rdf:first "p"; rdf:rest [ rdf:first "q"; rdf:rest [ rdf:first "r"; rdf:rest rdf:nil ] ] ] .
analogous to rdf:parseType="Collection" in RDF/XML.
That's all you need for the rudiments.

I hope that's helpful for anyone approaching the daunting world of N3, SWAP, and CWM.

King Leare


From a Shakespeare Authorship FAQ:

Additionally, the diarist Phillip Henslowe recorded a performance of a King Leare in Easter of 1594. Would it not be the most straightforward conclusion that in fact an early version of Shakespeare's Lear was already kicking around during the 1590s?

No it would not. Even though I'm a mere intelligent (or "intelligent") outsider, I know that the "straightforward conclusion" is absolute bunk. It's well known that there was an (anonymous, but attributed variously to Kyd and others) old version of King Lear knocking around before Shakespare got to his: it's even online.

As an occidental interdisciplinary enthusiast, it's pretty difficult to avoid The Bard, but I'm glad that I can study the field without having to produce any scholarly work on the subject: the precedent for Shakespearian study is quite overwhelming, with barely a stone unturned and conjecture running rampant, as above. If books on Shaksperian Punctuation are the limit, books on Marxism in Shakespeare are almost certainly over it.

Anyway, today's quote-that-really-ought-to-be-online is from H.H. Furness:

When [...] between every glance we try to comprehend each syllable that is uttered, or strain our ears to catch every measure of the heavenly harmony, or trace the subtle workings of consummate art,—that is a far different matter; therein lies many a lesson for our feeble powers; then we share with Shakespeare the joy of his meaning. But the dates of the plays are purely biographical, and have for me as much relevancy to the plays themselves as has a chemical analysis of the paper of the Folio or of the ink of the Quartos.

From his preface to A Midsummer Night's Dream, variorum edition, 1895.

Linux Community buys out Apple


The meteoric rise of the gallimaufry of Linux users and developers known as the "Linux Coalition" has today peaked with the completion of a takeover bid for Apple. Self-styled CEO of the Coalition, Eric S. Raymond, explained to journalists how the users of a free operating system were able to raise the money: "for a start, we have a lot of money left over from not having to spend out ridiculous amounts on second rate OSes. Secondly, we found quite a bit of spare change down the backs of our sofas".

But why did a community so complacently zealous about its own software need to buy out one of the world's largest computing firms? "Well", commented Bruce Perens, "we had this slight problem in the UI deparment. I have to go now." One of the biggest obstacles to GNU/Linux uptake was also its greatest advantage, that of its relative diversity and therefore lack of settled UI development. With this bid, it's possible that Linux may have accomplished the biggest step on its way towards world domination.

As one of the first effects of the takeover, Apple, now to be renamed Lapple (properly pronounced "lappel", and with a comical French accent), is to stop experimenting with BSD as the basis of its OS X operating system. Richard Stallman takes up the reins of the horse of explanation: "we noticed that BSD has a levenshtein distance of only one from LSD, which is proof enough that it causes significant tripping. The Berkeley association need hardly be mentioned." Certainly one would need to be "tripping" to write the copious amounts of documentation inherent to BSD systems, and the excellence of this move cannot therefore be understated.

Steve Jobs was too busy pieing himself repeatedly in the face, to show that he's more of a man than Bill Gates, to comment.



Unlike most of my fellow technophilic friends, I'm not enamoured particularly with science fiction and fantasy. I abhor it, in fact, to a great extent: as a person for whom logic, clarity, truth, and process are constant goals, I can't really understand why anyone would like to indulge in reading and creating second-rate sci-fi and fantasy except as a surrogate for a lacking artistic side. Sweeping generalisations, I know, but I do in part believe it, and it sets up a nice irony given what I'm announcing here:

With Morbus Iff, I've founded a new multiplayer world-creation game with a difference: it's based on a wiki, and the aim is the creation of a lexicon for a new world which we will be defining. As the title of this post suggests, it's called Ghyll. If you want to go straight to the action, you can check out the Ghyll wiki. There's an announcement on gamegrene.com too, which provides some more details.

I've decided to dispense with the usual pleasantries of this genre of game, such as the use of a character name. Morbus, though a seasoned practitioner of this kind of thing, is doing likewise.

What I'm hoping is that the scope of Ghyll is broadly defined enough for me to explore various topics that I like to pursue external to technology. In other words, I'm hoping that it can encompass those things that I'd like to write about but don't otherwise have a good peer-review based conduit for. My very first entry is a good example of that: Andelphracian Lights is a clear derivative of my work on Anomalous Luminous Phenomena. Morbus's first entry on the other hand, Agony uncle, exposes his love of narrative and mystery. We've been wanting to produce some literature together for some time now, and Ghyll is hopefully the best possibly consummation of that desire.

I'm not too sure why I use the possessive Morbus' on IRC but Morbus's elsewhere. Strictly, I believe the latter is required since the only general exception to the usual possessives rule is for archaic names ending in -es and -is. Enjoy Ghyll, anyway.

Semi-Automatic Programming


William Loughborough often writes open cabal messages in which he exposits various hopes, thoughts, and ambitions and generally tries to stir the recipients into action. He started his own weblog not so long ago after people insisted that he eat his own dog food about publishing information whenever possible, so now you can all share in the things that he has to offer. In sum, the leaves of his ideas are various—from Accessibility to Bucky and Cashmere at the start of the alphabet through Poker and Talking Signs onwards—but they all start to have an obvious trunk in connectivity, and the roots are love—with "Will B. Love" being a common nom de plume of his. (Nom de plume need not, incidentally, be italicised since it's not even used in French anymore; the common phrase now is, so I'm told, nom de guerre.)

So his latest post is about Automatic Programming, and I thought that instead of writing an open cabal letter in response, I'd reply on miscoranda and send him the URI. The obvious starting point is Google which brings one immediately to a good paper on Automatic Programming which I think gives the two main points well, even though it is a bit out of date (SETL and GIST?! bwahaha): Automatic Programming has been happening for a while, so it's getting more automatic all the time; and you can never do away with programming since programs are generally just descriptions of procedures. If you say "I want to know what seven times five is" in natural language, that's just like typing "print 7 * 5" into a Perl/Python interpreter. To use William's example, Seth still has to know what he wants to achieve; you can't just have a vague idea and hope a computer will make something of it.

I wasn't really sure what to add over these obvious points, other than to underline some of them—we should be working on continually higher-level programming languages, William should check out lisp again (great lisp programmers don't write lisp, they write lisp that writes lisp), and so on—but then I came up with an idea: since this is really an AI-complete problem, why not throw real Intelligence at it? It would be rather nice if there were a wiki devoted to having people come and ask for implementations, and have people interpret what is needed, further the ideas, and perhaps contribute code. Even I'd use it: for example, "is there an implementation of the polynomial time prime-number checking algorithm in Python?" The opportunity for diversity would be quite phenomenal, especially if you managed to incorporate decent code-segregation and versioning facilities in the wiki itself. Perhaps it's already been done, but "programming wiki" on Google doesn't yield much...

Opportunity in the Field of Almanacs


The al- prefix of the word "almanac" points to a borrowing from Arabic, but, as the OED says, "the word occurs nowhere else as Arabic, has no etymon in the language, and its origin is uncertain." The earliest known dating is from 1267 by Roger Bacon, but the first known use in English is from Chaucer's 1391 Treatise on the Astrolabe, and actually constitues a very recognisable definition: "A table of the verray Moeuyng of the Mone from howre to howre, every day and in every signe, after thin Almenak."

The well-established almanacs such as The Old Farmer's Almanac in America and Old Moore's Almanac (previously Vox Stellarum) in the United Kingdom and Ireland are curious for the fact that they have stuck apparently closely to their formats since their establishments in 1792 and 1700 respectively. Such has been the popularity of almanacs even in recent times that in 1943 the Dáil Éireann, a house of the Irish Parliament, debated why volumes of Old Moore's Almanac had been officially seized from stationers and defaced:

Mr. O'Donovan: Was the Minister afraid that the predictions in Old Moore's Almanac would affect the morale of our people?

Mr. Davin: Did Old Moore say when the War would end?

Mr. Cogan: Was the Minister's decision affected by the fact that Old Moore forecast the defeat of the Government at the next election?

Mr. O'Donovan: And that the stars are against him?

It's an amazement to me, then, that the Web is not as straightforwardly replete with almanacal—or almanacy, which Google seems to prefer—information as I expect it to be. There are many great sites, e.g. Heavens Above for astronomical data, BBC Weather for meterological data, and HM Nautical Almanac Office for everything else (I love their motto: Man Is Not Lost), but in these days of info-glut and the clear superiority of sites such as Google that provide unfettered presentation of data, I really would expect there to be a single site that could incorporate all of this whilst perhaps even retaining the scrap-book style of the traditional almanacs, updated for the current millennium.

Moreover, I think that the importance of customisation cannot be overestimated. Whilst I'm not a great fan of Web-services, a Web-service-like interface to these sites would enable anyone with rudimentary programming skills to set up their own interfaces—again to use Google as an analogy, it'd be much like their XML-RPC interface, enabling people to Google from IRC or the command line.

If only such information were static, it would be easy to provide, but one's location on the surface of the planet is the key variable to almanacs. At least it's something that can be handled much more easily online than in print, but even then I doubt oddnesses such as the double sunset at Leek, Staffordshire will oft' be accounted for.

Heretofore Unpublished Whimsy


I dislike writing miscoranda entries. There are always so many choices of what to write and how to phrase it, and I usually end up sounding barely literate next to people like Aaron Swartz, so in fact most of the entries I work on I don't even bother to publish. Some of them have a string of semi-interesting ideas, but I'm rarely happy with the results.

A lot of the time, I'll try to capture a thought chain, and then realise that in the process of trying to capture it you tend to affect it too much. It's like the quantum principle of the observer always having an effect on the outcome. But to cut a long story short, I've decided to collect a handful of fragments from unpublished entries and provide them here:

On Antilinearality:

I used to rank all sorts of things into "favourites" lists, e.g. songs, books, etc., because I thought it'd help me to discover other things that I might be interested in, but it just drove me nuts because I'd find that I'd like P more than Q, Q more than R, and R more than P.

On Learning:

It's a bit of a chicken and egg thing: I have a good idea for teaching a language, but I need to learn the language first, so if only my idea were already in place, I could use it myself! And the further irritating part is that I'll never be able to test my own idea as a learner myself unless someone else does it, since once a language is learned, a language is learned.

On Art:

There's a phrase in Chimes of Freedom that goes "through the wild cathedral evening the rain unraveled tales", and it's always fascinated me because cathedral as an adjective modifying evening just doesn't make any sense at all, and yet it's clear exactly what's meant. [...]

Photorealists are extraordinarily skilled, but the nice thing about art is that it lets you create, not just mirror. [...] I kinda expect holographic painting to become popular at one point (if you've ever looked down a holographic microscope and been startled, you'll know what I mean and why), and I've also got a strange photosonic idea from years ago that I sometimes expect to see come to independent fruition.

On Invention:

I had the idea of making it into a book first of all and calling it the Art of Innovation: I'd've gone through some of the neat innovations through history (perhaps some surprising ones, too), done some of my own, and then maybe written a more general section too. But the thing about books is that they're too much effort to get published, so I'd probably just end up putting it online anyway, and then it'd be ignored for the rest of eternity, and I'd grumble. So yeah, I decided not to do the book thing.

On Semantics:

That there were Five Kings stood around in the ergative case, discussing Boyle's Law with Demetrius. And, I dreamed, the cyllowre that oversprawled them was replete with gems and stars.

There was painted a pastiche of intellectual osmoses, that within every page of a charity shop book there lurked a post-nominal, and that Camelopardus had been reinstated and a constant relation. That prostitutes with left-handed masks could go to the vote; that larchfinches ate tallow provided by the Danish government.

That Kitty's Rambles were conducted in a hermitian conjugate matrix I also dreamed—and Elizabethan underwear was made chairperson of the Operatic Society as I found a T'ai chi t'u on a desk by chance. But ine kan decheinen buochstap anyway.

The Man and his Man


I was listening to Gladys Knight and the Pips, and a line in the song reminded me of the title of this post. The Man and his Man is a rock somewhat off the coast by Perranporth, a name which sounds Celtic in origin—and was probably influenced by it; but is actually derived from "St Piran's Port" according to A Dictionary of British Place-Names (2003, A. D. Mills), and was first recorded in 1810. The use of a semi-colon in place of where an em-dash and comma would logically be required I managed to garner from The Guardian newspaper, whose punctuation algebra should be second to none.

I've published two things of such minor import that they've now been overshadowed by the very titling of this entry, but nontheless here they are: a rules-writing-rules example in Notation3 on public-cwm-talk; and Notes & Todos: The PIM Saga, being a summary of my early efforts to produce a Personal Information Management program.

Neologistic Classicicity


As usual, I've been coining new words semi-intentionally in the general flow of daily writing, and I've managed to garner an anthology of those that I've produced within the previous few weeks. Many of them are just oddly suffixionalised, but some are quite novel and interesting; most already appear in Google, some in odd contexts, some just because my initial mentionings of them have already found their way into Googlebot's index.

With Googlecounts (as of 2004-10-17) following in parens and omitted where the Googlecount was nil, then, and in alphabetical order: aboundations, accognizanced, alchymerical (1), anticouth (1), applaudablism, assbackwardised, bewailments (16), blammage (265), bumblecrumbs, bytepool (27), classicicity (6), cogited (1), completionary (7), contradynamic, dawntime (217), delimitry (3), dwellances, explosionable (17), fabricationally (17), festivule (6), foreshawl, gesticules (407), haikuicity, hobbledygarians (1), homoeroticicity (3), intertwangled (33), leftermost (57), logichess (13), mavenite (186), messtangular, messular, metachanical (1), mirthmaker (149), obfuscationary (13), pretagnonistic, purificationary (16), renderous (12), ribaldness (130), romanticicity (1), scrapoupage, seemingry (2), suchlikeness, suffixionalised, superfluousity (78), symbraldry, terseless (2), thankfulently, tinctile (12), unreified (248), and weaveries (47).

I think my favourite of the list is "alchymerical", which came from a challenge to redraft a line that Eric Hopper had written with a certain deliberate formality with an even greater level of inherent stodge and whimsy. Here's what I came up with: "my petition is thus: it be beseeched and entreated of you in humble kindness, dear sir, that a navigational device might be forthcoming from your august possession such that I may be led unerringly through uncharted waters unto the procuration of such ineffably aetheric and alchymerical elixir as might prove the very whettance of my voluptuously epicurean inclinations".

I'm quite pleased with "scrapoupage" and "terseless" too.

On a completely unrelated matter (hey, this is miscoranda after all), Paul Mutton's putting his Genuine KiteCam case for kite photography up for sale and I told him I'd mention it since I was the one that suggested he put it up for sale in the first place. I'm thinking about calling Trading Standards since it's not clear until you get two thirds of the way down the page that this is the very same packaging that caused his camera to blaminate itself to an unworkable extent. Don't let that put you off buying though—it's a unique piece of rare memorabilia that's even been mentioned on (gasp) Slashdot!

N-Triples, rdflib, and Pyrple


I need something that'll automatically take keywords from posts and come up with a witty title for me. I turned to uuidgen and very nearly named this one 5c454c85-a989-4535-a064-7ec08065bb6c, but I thought I'd better not.

I released a new ntriples.py module a couple of days ago which parses N-Triples with high levels of efficiency and specification compliance. I wrote it as a semi-drop-in replacement for rdflib's bug-ridden N-Triples parser. I've been rather interested recently in pursuing a merging of rdflib with my own pyrple, and proposed that idea on www-rdf-interest to Daniel "eikeon" Krech.

In a random search today, I stumbled across sqlwhois.com, on which you can find, amongst other things, which domains alphabetically sort either side of your own. On the left of miscoranda there's the NetSol parked miscnetwork.com, and to its left there's the unreachable miscorbatas.com. A little more interestingly, to inamidst's left I have inamidnightmoon.com, the site of "tenor saxophonist/flutist Jeff Andrew Simpson"; to my right I have the nicely graphiced inamien.com, whose purpose I'm unable to ascertain given that it's in Japanese. Translations welcome.

From Phonaesthemes to Theôria


As often as I invent words, the two subjects of this essay (in the Montaignian sense) are just random ones that I've garnered from studies on philology and neo-platonism. Well, not quite random: they're both phonaesthetical in themselves, and the "ph" and "th" combination is always difficult to avoid. Nor even is my idea for having a prefix denoting elementality entirely unreleated to all of this. (The five Chinese elementals of 金 木 土 水 火 usually translated as metal, wood, earth, water, and fire, might better do with some prefix along the lines of meta- or para-, except describing something closer to the Taoist philosophy.) The fact that I coined, yes there had to be at least one coinage in here, "thoun't" earlier (in "whyfore will thoun't start?") is pretty much unrelated, though, and comes from the fact that the flashblock extension for Mozilla Firefox prevents slogger from working. Gah! Also, just as William pleads you to look up "Hysteresis", I think I'll at least bid you do the same for Phonaesthemes and Theôria. And just to colophonise the miscellany: "The moon, alas, is no drinker of wine." - Li Po.

Stratégies De Langue


Je suis dissatisfait avec l'inscription en Anglais, ainsi j'ai décidé d'écrire cet article entièrement en Français. Malheureusement, comme vous avez pu avoir découvert de cette première phrase, mon Français est plus mauvais qu'insondable, ainsi j'ai dû me tourner vers la traduction automatique à la place. Ma stratégie pour lui faire le travail est d'écrire l'original en Anglais qui est aussi non ambigu comme possible, espérant que le programme de traduction ne trébuchera alors pas sur des idiomes, des mots difficiles, et des obstacles semblables. Je suspecte, cependant, que ce plan échoue et non seulement volonté que je semble ne pouvoir pas parler Français, mais également un idiot incapable de parler Français. Si quiconque voudrait m'offrir des leçons françaises libres, elles le plus avec reconnaissance seraient reçues; ou même une bouteille intéressante de vin Français au lieu de cela.

Cet article devrait soulever un bon nombre de questions au sujet de construire des langues simples de sous-ensemble, mais une fois que je commence à parler de la linguistique technique, moi vais venir à une des difficultés de la matière: la spécificité augmente la complexité. Et le programme de traduction va aller des écrous et lui faire le regard comme je parle des oies riantes vers l'arrière, et les dessins animés monolithiques des apothecaries pourpres, ainsi je me sauverai l'effort et ne tracasserai pas. Je voudrais vraiment aller au des Fête des Lumières de Lyon, mais je suis trop occupé, et je ne pense pas que mon Français est vraiment assez bon pour m'y arriver: je finirais vers le haut au Monaco ou quelque part.

Je serais reconnaissant si un naturel pourrait me faire savoir la traduction a disparu, satisfont!

Classic Enjoyment


Due to my general, perhaps transitory, distain for things contemporary, I've been tending to avoid anything written after the Second World War as much as possible. The benefit in doing so is that you get to know how it all turned out—like watching a television series on DVD for the first time instead of waiting each week for a new episode. The disadvantage is in not being able to reply to the authors.

One thing that becomes strikingly apparent is just how lucidly recent some of the great human discoveries and inventions are. Neanderthals were discovered only in 1856, just last month for me, and Darwin only proposed his odd theory of natural selection last week. In that, at least, contemporary society is still aligned with my metaphilosophy since clearly we don't know what to make of it yet. Television is either non-existant or in its infancy, and one has to wonder whether Plutarch would've been improved any had he made the odd allusion to Frasier or South Park.

On the other hand, it also becomes striking just how advanced humanity has been on all matters except for science and invention over the past thousand or two years. Einstein wasn't wrong (of course) when he said that the ancient Greeks and Chinese had a far clearer vista of thought than we could ever hope for. It's also nice to observe patterns of cultural thinking that are predominantly successful, and conversely those things that are now retrograde to the whims of our nature. Trying to recreate leaps is better than trying to recreate or revenerate their effects, but even better is trying to recreate the seriousnessless of it.

Complaining About Explaining


People these days are very bad at explaining what they mean. I'll attempt to explain what I mean by this, but obviously you shouldn't be too harsh if I'm unable to pull it off.

Today I saw the name "Meredith" mentioned as a great writer, and since I'm always looking for interesting new folk to read, I decided to chase up the reference. Google wasn't particularly helpful on just the surname, so I thought I'd get out the heavy artillery and plump for Wikipedia instead. It didn't let me down, of course, and I got through the disambiguation page with ease to land finally upon George Meredith.

Sadly, the article itself has proven terrible for my purposes. It has a summary, a brief biography, and then a list of his prosaic and poetical works; there's nothing at all to suggest that it's a stub article, so I assume that it's not going to be tended to soon. What I want to know is the style in which he writes, which of his works are considered the greatest, how he relates to other authors in the same period, and so on. It doesn't have to go beyond the scope of an encyclopaedia entry exactly, but it should at least come up to the standards of some other entries: Alexander Pope for example gets about double the explanation. The real grumbleworthy part is that two of Meredith's novels are linked to existing articles. I presume that they're his most popular since they're the only ones to have been written about so far, but upon chasing the links I find that one (Farina) only mentions the novel as a bullet-point and contains no extra information, and the other (Vittora) doesn't even mention the novel at all!

The underlying problem, I believe, is that people aren't taught to explain—they're taught to learn. To learn something, it's true that you have to deconstruct it to a point where you understand it, but once you've done so you forget how you picked up on that information. Often, you'll pick up something intuitively, only to find later on that you can gain a fuller appreciation of it by trying to understand it more overtly. To furnish an example, I was explaining Notation3 to a friend who was wondering about blank bNodes. Without going too much into the particulars of the situation, I had actually been overtly missing the very purpose of blank bNodes which are able to hold material inside themselves syntactically, and in explaining it to Cody I managed to convince not only him (hopefully) but myself as to the rationale of the damn thing. The allusion of saying that the bNode is spread out amongst tripled helped not only him to use the language, but me to understand why it was invented that way.

But back to the root causes. University education, probably most in those sorts of courses where you're taught by a faculty of researchers and not of tutors especially, seems as good a place as any to lay some of the blame: in many fields, technical terms change depending not only on the specific area in which you're working but also in the area itself depending on by whom you're being tutored. The word "predicate" is a wonderful example, meaning as it does different things in computer science and linguistics, and in fact two things in linguistics. Of course, this is also part of the challenge of university education, for it's a time to learn independent study skills, and to connive all kinds of ludicrous mnemonic devices for what you have to learn. My complaint is that the compulsion to elucidate, to explain, to summarise, is never taught as a separate discipline, and rarely ever taught within courses themselves. But the art of composition, of logical argument and rhetoric and such, is something that we can always do with bettering ourselves in, and it'd be nice if it were treated as such.

Astronomical Twilight


It's really tiresome, writing. Astronomical twilight must've long passed, and the Meterological Office's forecast for my area says that the stars should be visible, but I can see the sky and no stars from where I write. Perhaps they've been superceded by a new RFC. Perhaps they're on holiday visiting the Large and Small Magellanic Clouds in the southern hemisphere. Or perhaps the Meterological Office have, not for the first time, failed to predict the present. There are some mysteries that for a while you have to let stand.

Hypocrisies in history! A dramatic bit of alliteration for what is actually a fairly laid back subject, and I've been reading about it quite a lot recently with some delight. The main tenet is that the tension between documentary and archaeological evidence has led, and continues to lead, to some grand conjectures that prove wholly wrong even though they're supposedly based on "sound" evidence. Astronomy has had the same kind of problems, and I suppose every field of scientific discovery suffers from it to some degree. But there are some fundamental constants which you can be sure of: e.g. leisure leads to cultural development. When hunting and gathering stopped and agriculture began, one of the first things that farmers harvested were words. Eventually some bright fellow called Michel Eyquem De Montaigne, who was not a farmer, invented the essay, to my utter satisfaction and woe. Sometimes I think I'd prefer the kind of cultures you get on agar jelly to the ones of which I'm meant to belong.

Hypocrisies of science! Not as alliterative, but much more scandalous: I've been reading science of art pieces; I tire quickly at being asked to trust people writing about inspiration, masterpieces, and genius when they clearly display none of those qualities themselves. I go back to the masters instead.

Cody challenged me to work the word crapbit into this piece somewhere, and I must just congratulate myself for having done so in such a radically overt manner that he'll probably accuse me of cheating and get me to work it furtively into the next ten. "Oh", he may say, "but you were working on a piece of such magnificent and beautiful cohesion! It's such a shame to have seen it destroyed so by my simple challenge. A feat of startling grotesqueness!" On the other hand, he'll probably just plunge his head into a bowl of soup as is his wont and trademark.

Metaphysics and epistemologies may be the only things worth arguing over, but they dunarf get tedious quickly when writers upon the subject take to squabbling over the semantics of categorisation. "Metaphysics" is now a hideous misnomer. Aristotle didn't, in fact, coin it; he left the titling of his stunning sequel to the earlier Physics series in the hands of some later hacks. If he had deigned to title the work himself, he probably would've called it the "μικρή τυπωμένη ύλη", which is Greek for "Small Print". Modern metaphysics tomes really do read like the things you see in the back of cooker manuals.

Upon a very close inspection of the sky, it turns out that the stars have indeed decided that it's too cold out there and have instead tucked themselves in under a blanket of cottony cloud. Yet the Meterological Office still states, with an air of utter authority, that the stars have done no such thing: it is clear outside. It bugs me that the people running the country probably put on odd socks most days, but at least that's a fair representation of people in general.

Findall Utility


There are some things that grep just can't do. If you want to count how many occurances of a name are in a document, for example, grep won't work since that name may appear more than once on the same line. Moreover, there are many things for which it's useful to actually extract the content of what you're searching for, not just the line: you can scrape for URIs, emails, server requests in logs, and so on. To that end, I've had a little script around for doing just that lying around, and I rewrote it last night and published it: findall.

It's a simple tool, and would probably be better written in perl, but anyway it allows you to do some interesting stuff. For example, if you want to count the amount of words in a document you can do something like:

$ findall '\w+' document.txt | wc -l

You can also use one of several built-in regular expressions selectable using the -t flag, and get a list of the tokens available using --list. So for example, to get all the URLs from your server logs:

$ findall -t URL *.log | sort -u

It's also good for doing quick hack server statistics, for example:

$ findall 'GET \S+' *.log | sort | uniq -c | sort -rn | head -20

And, used in conjunction with the repres tool that I also published last night, for getting non-ASCII byte ranges out of documents:

$ GET http://www.w3.org/ | findall -t hi-bytes | \
   sort -u | findall -s ' ' '.*' | repres
'\xc2\xa0  \xc2\xa9  \xc2\xae  \xc2\xb7  \xe2\x80\x94  \n'

Note the use of the -s flag therein, which is the sepchar used to stick everything back together again; by default it's a newline character. This means you can use it to rip the newlines out of files quite easily, or even do crude CRLF conversions:

$ echo 'blargh\nhmm\nsomething\n' | findall -g -s '' . | repres

$ echo 'blargh\nhmm\nsomething\n' | findall -g -s '\r\n' . | repres

The -g flag enables it to simulate grep. The trailing newline is always output, though it would be easy to add a flag to suppress that. If you want to do a regexp over multiple lines, you have to use the -m flag:

$ findall -m -t html-comment <<EOF
> <html>
> <!--
> blargh
> -->
> </html>

For such a simple tool, it's rather uncannily useful.



I've just released a new Notation3 parser called n3proc. It comes with a fairly extensive test suite, and is divided into three parts: a metaparser (n3mp.py), a parser (n3p.py), and a processor (n3proc.py).

The whole suite is based on TimBL's N3 Grammar work, so the parser is able to generate an event stream from N3 files using the standard RDF BNF. Based on that stream, n3proc.py can convert N3 into triples, and in its default command-line mode it prints out NTriples on stdout. If you want to try it out, you can get the tarball (n3p.tar.gz), and try one of the things listed in the release announcement on public-cwm-talk.

It requires rdflib for the metaparser and pyrple for the test suite, but n3proc itself only requires Python 2.3 or later. Talking about pyrple, apparently Pytypus is using it; one of the first applications of it not made by me that I know of.



The n3proc test suite is currently up to sixty-five test cases, and now covers some particularly tricky corner cases (especially formulae-08 and paths-06). At the moment, there's even a test for "@this" which uses it in the sense of referring to the current formula, though that functionality doesn't really exist; it could be fulfilled using either <>!log:semantics or a new keyword such as @current, but that has a lot of potential associated discussion.

In the Wikipedia article on Thomas Gainsborough, there's a wonderful quote of his that runs:

I'm sick of Portraits, and wish very much to take my viol-da-gam and walk off to some sweet village, where I can paint landskips and enjoy the fag end of life in quietness and ease.

"Viol-da-gam" and "fag end" kept me amused for quite a while in figuring out their meanings. The ever-magnificent Jon Hanna provided me with the meaning of the latter, saying that it constitutes the "frayed bit of cloth or rope". On the other hand, viol-da-gam doesn't appear anywhere else on Google but that quote. I had the idea of search for viol-de-gam instead, and the only usage that came back was from Shakespeare; from Twelfth Night in the First Folio, thanks to the line breaks:

 To. Fie, that you'l say so: he playes o'th Viol-de-gam-
boys, and speaks three or four languages word for word
without booke, & hath all the good gifts of nature.

Long story short, it's from the Italian "viola da gamba", a leg-viol, and the OED defines it as a "viol held between the legs of the player while being played; in later use restricted to the bass viol corresponding to the modern violoncello." First usage? 1597.

Also, I'd like to apologise to Cody for thinking for so long that his wonderful jottage logo was in fact just a USELESS BLACK BLOB. Sorry!

On Toki Pona


As much respect as I have for Sonja Kisa, I have to say that one particular feature of Toki Pona has violated my expectations of a language so much that I have managed to devise a language law or aphorism for it: that any language which requires more than five words to say the noun "duck" is clearly quite dazzlingly broken.

Rather like Old English, Toki Pona is limited in its range of nouns—in fact, there are only 118 words overall—which means that you have to use a phrasal circumlocution to denote most objects. For example, a housemate is jan pi tomo sama, i.e. person-of-habitation-same. I can see how such a word doesn't deserve a radical, and I can fathom the rationale and the free philosophy behind a language based on the odd juxtaposition of Taoism and the Sapir-Whorf Hypothesis (there's a-whole-nother essay), but what I cannot in the least bit comprehend or find in the least defensible is that startling omission from the core vocabulary of so primitive and ubiquitous a thing as the common duck.

It took me a while to find out how Toki Pona speakers talk about ducks. It was first suggested that telo waso (water-bird) be used, but how poor is a language that cannot distingush 'twixt duck and swan. Soon enough however, thanks to Google, I was led to a Toki Pona translation of Monty Python and the Holy Grail, which includes the following:

ARTHUR: A duck.
jan Asa: waso li awen lon sewi telo.

Therefore... Toki Pona is witchcraft! Er, wait, no. Therefore, duck in this translator's bemuddlefraught mind is waso li awen lon sewi telo—literally bird-that-resides-in-above-water, or, as Cody put it, "bird that stays on the water". Which still doesn't differentiate it from a swan, and moreover doesn't provide any more information than telo waso. I must say that the language's free-flowing fun philosophy comes over more as a desperate grasping for War and Peace sized phrases to describe what should be endearingly simple concepts. This simple language, by an overly forced simplicity, is anything but.

On the other hand, it isn't all doom and gloom: for example, there are two separate forms for saying "goodbye"; one for the person that is leaving, and one for the person that's staying behind. This seems to be a problem more poignant for users of realtime chat than anyone else, since often one person's leaving will inspire others to do the same, prompting no one to know who's still around and who isn't. It's a bit like how some eastern languages structuring numbers as "two-ten-one" for 21 allow children to learn to manipulate numbers more easily since they're already divided up into neat little sections. Seventeen minus nine is a calculator job, whereas one-ten-seven minus nine can be done by taking the nine from the ten and adding to the seven: eight. But just try adding ducks up in Toki Pona!



Three new pieces of code, each independent of one another:

Oddly enough, these hacks do find use from time to time. For example, crschmidt is using ntriples.py, and JimH is using wypy. So much Symbian hacking.

The term "bric-à-brac" was first used by William Makepeace Thackeray in 1840. According to the famous French lexicographer Paul-Émile Littré, it's derived from the phrase "de bric et de broc", i.e. by hook or by crook. Nobody's sure what by hook or crook originally meant, but it was first used by John Wycliffe in his Controversial Tracts around 1380.

Incidentally, this entry was written over two consecutive days, and was originally only meant to announce FOAFCite and datauri.py, but since I took so long in writing it I was able to develop rdfdb.py meanwhile.

Stealth RDF


It turns out that I'm a habitual RDF API component writer, even when I'm trying to avoid it. I'm meant to be working out the details of a pyrple/rdflib chimaera with Daniel Krech, but a couple of RDF parsers here, a database module and isomorphism utility there, and suddenly there's enough components from which to assemble an entirelyish new package. Monty would probably say that it's a bit of a MontyHasBeenHackedBySBP thing to do, but then that's just him.

The isomorphism utility (rdfdiff.py) hasn't actually been incorporated yet, which is odd because it's my favourite pick of the crop so far—especially since I'm still waiting for a reply from Jeremy Carroll. It's only 1618 bytes, but takes little over a minute on this box to parse and compare two 100,000 triple, eleven megabyte, files. By comparison, multistep.sh takes four minutes and doesn't yet pass the ntc tests. If somebody could port rdfdiff.py to Redland/C, that'd be wonderful.

I've also managed to coin yet another syntactically consistent English word that has three e's in a row, though the first, agreeeth, has been used a handful of times before. The new one is ergativeee, which is fairly comical if you're into grammatical case jokes—and since you're reading miscoranda I'll assume that there's a fair chance that you are. (Update: I later realised that ergative + -ee = ergativee, and John Cowan and Kevin P. Reid also pointed out the mistake—thanks! I'm leaving it in so as not to be a revisionist, but also because I was close: if the -e on the end of ergative wasn't silent...)

Flibbertigibbet and Purre


I've just published Flibbertigibbet & Purre, which is "the chronicle of how I started out researching the word 'flibbertigibbet' and ended up finding a selcouth pun of Shakespeare's from King Lear that's lain undiscovered by all but one or two people since 1603, amongst other things". Enjoy.

RSS 1.1


I'm pleased to announce the release of the RSS 1.1 Specification. This is a bugfix version of RSS 1.0 that I've developed with Christopher Schmidt and Cody Woodard which utilises more up-to-date features of RDF and XML, as well as implementing sundry architectural and i18n improvements.

To coincide with the release, we're also providing a number of implementations including an RSS 1.1 Feed Validator; an online RSS Converter which can turn RSS 1.0 into 1.1, and 1.1 into XHTML; and a Rough Guide to RSS 1.1 which contains links to various other materials including our test suite and implementations for Movable Type and Wordpress.

Feedback would be very much appreciated, and you can contact the authors at either of the addresses given in the specification, by joining #rss1.1 on the irc.freenode.net network, or by emailing the rss-dev mailing list if you want to talk about RSS 1.1 in a general context. You can also leave a comment on this post.

Keeping up with the Swartzes


2000 Aaron Swartz starts working on RSS 1.0.
2005 Sean B. Palmer starts working on RSS 1.1.

2001: Aaron Swartz joins RDF development team.
2001: Sean B. Palmer joins WAI PF cross-review team.

2001: Aaron Swartz co-founds Swhack with Sean B. Palmer.
2001: Sean B. Palmer co-founds Swhack with Aaron Swartz.

2002: Aaron Swartz begins work on Creative Commons.
2002: Sean B. Palmer completes work on EARL.

2002: Aaron Swartz starts darkly humourous diary-based weblog.
2003: Sean B. Palmer starts lightly boring dairy-based weblog.

2004: Aaron Swartz starts at Stanford.
2004: Sean B. Palmer eats a packet of crisps. (At university.)

2004: Aaron Swartz receives profile with photo in Wired.

(Yes, this is a parody.)

Toki Pona Translation


Toki Pona is a small invented language that I wrote about recently in a fairly critical manner. I therefore had to expect that the Law of Criticising Invented Languages, which dictates that someone will reply to you in that language in the attempt to force you to learn it, would be upheld, and it was. Someone calling themselves Jan Wasolitawa replied to my essay with the following:

mi pilin e ni: sina toki ike e toki pona. ni li ike.
"waso telo" li nimi pona. tan seme la sina wile e ni: nimi mute pi waso? ni li pakala.
mi pilin e ni: sina sona ala e nasin pi toki pona. mi toki e ni: nasin pona li pona. sina sona ala sona e toki mi?
P.S. toki pi nanpa suli li ike mute. o weka e nanpa suli. ni li nasin mani! nasin mani li ike! ni li pakala!
Jan Wasolitawa

Thankfully I have a friend who's learning Toki Pona, and this was a good opportunity for him to test his skills. At the same time, I decided to race him by attempting an automatic translation using a couple of Python scripts. He beat me both in speed and quality, but we still think that the script is close enough to Babelfish quality to warrant release, so I've made a Toki Pona to English Translation Service using it (with source and nimi). Here's one of the ways in which it translates the comment I got:

I emotion such that: you your language negative such language good. this is negative.
"bird liquid" is name simplicity. from which it's said you your to want such that: word very belonging to winged animal? that is blunder.
I emotion such this: you your knowledge no such way belonging to talking good. I language such that: manner good is simplicity. you your wisdom not knowledge such language I?
P.S. language belonging to number -th tall is negative many. O hey! away such number -th tall. that is way material wealth! way money is bad! that is blunder!
People Wasolitawa

It actually came out much better than I expected. Toki Pona's ambiguity is through the roof, so it's only possible to get a summary understanding even if you speak the language. That's the nature of the beast. As Cody put it:

In order to speak Toki Pona correctly, you have to imagine yourself on an island, totally separate from all civilization, roughing it jungle style and such. Thus a flashlight would really be a "box of light".
Jan Nasacody (Cody Woodard)

As for the comment itself, our combined understanding of it from both manual and automatic translations is that J. Wasolitawa doesn't like my criticism of the language and thinks that I want too much from it. One of the best bits of translation was Cody's "there's been a clashing of zens"; my zen isn't the same as Toki Pona's zen. Well I refute that: I think that the concept is fine, but the implementation leaves a lot to be desired, as can hopefully be seen from our combined translation effort. The fact remains that English allows one to get much closer to Toki Pona nature via its expansive vocabulary, because in Toki Pona you have to be circumlocutionary, whereas in English you can use words such as "circumlocutionary" to sum up what'd be a whole paragraph of repeated words in Toki Pona whilst still missing the point.

And we think the P.S. is something to do with numbers being a nasty side-effect of commerce. Which is just wonderful. How else can one count how many butterflies have been seen on a particular day but with numbers? If I've misunderstood the Toki Pona way and that way is actually to eradicate every single thing of meaning and value to us, then perhaps there is a clashing of zens after all. After all, as D.T. Sukuki said, "Zen is not nihilism". But I don't think Toki Pona is either: it's quite pretty sometimes.

Python Meta-Decorators


It's a common idiom in Python's decorator syntax to return a nested wrapper function from the decorator. The following decorator function, for example, wraps the function it decorates in a simple exception catcher:

def safenise(func):
   def wrapper(*args):
      try: func(*args)
      except Exception, e:
         print 'Error: %s (%s)' % (e.__class__, e)
   return wrapper

But that's messy. The whole point of decorator functions is that you can extract common idioms to be shared by a suite of functions, and since decorator functions themselves have this common idiom, why not create a meta-decorator for them? So I spent a while puzzling it over and created exactly that:

def decorfunc(decorator):
   def wrapper(func):
      return lambda *args: decorator(func, *args)
   return wrapper

Essentially what it'll do when you apply it to a decorator function is change the argument structure from (func) to (func, *args), and allow you to omit the wrapper function. Here's how it'd be used with the safenise example:

>>> @decorfunc
... def safenise(func, *args):
...    try: func(*args)
...    except Exception, e:
...       print '%s (%s)' % (e.__class__, e)
>>> @safenise
... def barf():
...    print 1/0
>>> barf()
exceptions.ZeroDivisionError (integer division or modulo by zero)

Now, the cool thing is that since decorfunc is also decorfunc following the decorfunc idiom, we can rewrite it using itself:

def decorfunc(func, *args):
   return lambda *wargs: func(*(args + wargs))

Then safenise and barf can be declared in exactly the same way as above, with the same results. This could, of course, be done an unbounded amount of times, but all you'd get is a lot of pointless recursion, so it's just a demonstration of its meta-decorator nature. Writing it using itself is a bit like when the C compiler was first able to compile its own source code, or when you start using an editor you've written to edit its own source.

Here's how the example in the decorator syntax documentation would be written using the meta-decorator:

>>> @decorfunc
... def require_int(func, arg):
...    assert isinstance(arg, int)
...    return func(arg)
>>> @require_int
... def p1(arg):
...    print arg
>>> p1(1)
>>> p1('1')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "<stdin>", line 3, in <lambda>
  File "<stdin>", line 3, in require_int

Perhaps it'd be even better using the proposed Optimal Syntax for Python Decorators, but it's easy to think of decorators as inheritance for functions, and "using" mightn't be the best keyword to reflect that.



No matter how miscellaneaful this wotsit it (you can tell this is going to be a post replete with awesome), I still feel that chatting about dodo cloning techniques is a bit beyond its remit and calibre, so I've made a new weblog/wiki called eph to house that sort of thing.

The interface is both IRC and HTTP based to enable me to write material as I go along and then polish it up later—as I often want to do when chatting with someone about an idea, for example. The eph backend, pwk, is mostly stolen from my own pwyky, though the script that produces the RSS feed is new and filled to the brim with the DOM Technologies™.

The idea for posting to a weblog from IRC has been around for a while, and Christopher Schmidt is even using my old noets code, which I've now written about on eph, to drive his own little IRC-powered miniblog; but combining it with a wiki so that posts can then be easily appended to and edited later is unique as far as I know.

Highlights, if you can call them that, of eph at the moment include Heroes and Heroines, The Project Name Problem, and The Magical Snoobird. Though the thing has an RSS feed, I really wouldn't bother subscribing to it, quite frankly. If I write anything half decent there, I'll mention it here too.

Python Method Meta-Decorators


Decorators in Python allow you to do something similar to class inheritance but for functions. One of the most common decorator idioms you'll come across is that of wrapping a function to provide some functionality such as trapping exceptions, or type checking the arguments; so common, in fact, that I wrote recently about a decorator for it which I've termed a meta-decorator.

The meta-decorator I wrote before only worked for functions, however, not methods. So you couldn't do calls to self, which is quite a pain sometimes. If you're wanting to trap exceptions, for example, in a class you might want to call an error(...) method that makes an appropriate callback. To that end, I've modified the original decorfunc meta-decorator to produce one that will work for methods:

def decormethod(decorator):
   def wrapper(method):
      return (lambda self, *args:
              decorator(self, method, *args))
   return wrapper

This is all a bit abstract, but hopefully the following example will make it very clear how it can be used to good effect:

class Blargh(object):
   def safenise(self, func, *args):
      try: func(self, *args)
      except Exception, e: self.error(e)

   def error(self, e):
      print 'E: %s' % e

   def barf(self):
      print 1/0

To give the simple explanation again, it's like the barf method is inheriting from the safenise method: so any time it's called if there's an error a call will be made to self.error, as it will be any time barf is called since 1/0 can't be done in Python:

>>> b = Blargh()
>>> b.barf()
E: integer division or modulo by zero

So just like the function meta-decorator, the decormethod modifies the argument structure, but to receive self, func, *args, so that you can make calls to self.

If only the decorator syntax were a little clearer, this might be easier to write about...



Most large editors are very configurable, and ones as large as emacs even have their own built in programming languages to allow people to make extensions. But this isn't the ideal design. It's just a consequence of a small project becoming larger and larger with more post facto features and bloat being added later on. This means that the wheel is being reinvented big time whenever yet another a large editor project is started.

So what about an editor that's properly designed from the start? What about an editor that's modularised for extensibility in a real language instead of having a nonce language written into it? What about having a stable core, and then a range of peripheral modules catering for extensibility and major project forks?

It's with this in mind that I created the emano editor, and completely and on purposely failed to meet the aims stated above. Instead of being a core of C with SWIG bindings for Python, Perl, and Ruby as I envisage for the larger project, emano is a flexible term editor written purely in Python. But it's still a very useful step on the way to that larger vision.

The idea behind emano is to marry nano's elegance with emacs's power, and so far it beats nano for elegance whilst failing of course to beat emacs for power. But the secondary idea behind emano is that it forms a proving ground for experimental features that should take a greater role in all editors. For example, smart commenting, easy selection, and grepping for search and navigation—all of these are already implemented in emano.

The documentation is sparse, but a few people are using and enjoying it. It's quite stable, but be wary of editing important documents. It will backup documents for you, a feature which will later become optional, but it would be a good idea to use keylogging too.

Please do try it out, and let me know how you get on with it. I'd also like to hear of any other editors that are either similar to emano, or conform more closely to the idea that I've sketched above.

Euphonic English Campaign


‘Publicly’ is surpassed only by ‘subtlety’ as English's most abominable word. But whilst we're lumbered with the latter, for the former we have a fine and upstanding alternative in ‘publically’. Or we would do if the word wasn't so unfittingly maligned—a fate we can, thankfully, spare it with just a brief look at the history of the word and the logic behind its use.

In 1567 the adverb ‘publikely’ first made its appearance. It stayed that way for about a century until ‘publiquely’ and ‘publickly’ arose in the late 17th and early 18th centuries. The form ‘publicly’ didn't make its appearance until 1855, nearly three hundred years after the first form. Sixty-five years later, we shifted once again to ‘publically’. It was Edith Sitwell who gifted it to us.

But in so-called edited English, ‘publicly’ remains standard. Why? Because of the usual pernicious mixture of prescriptivism and ignorance. Spelling became fixed in English when ‘publicly’ was at the zenith in its vogue, and the simple combination of adjective + "-ly" suffix seems eminently logical. But neither spellings nor meanings are ever entirely fixed in language, and moreover ‘publicly’ isn't as logically consistent as it may first appear.

The OED, after stating that "-al" is often used to form secondary adjectives, notes a cornucopia of adverbs that no longer have their "-al" counterparts. The adverb, it says, "is almost always in -ically even when only the adj. in -ic is in current use, as in athletically, hypnotically, phlegmatically, rustically, scenically." And you can add to that asyndeton my own canonical retort of ‘basically’, as in "publicly is basicly an abomination". Please, let publically be.

Everybody interested in lingustics has their own pet hates, and whilst ‘publicly’ is the most prominent of mine, I do have some others that are perhaps less well founded. One is of pronunciation: when ‘secreted’ and ‘appendix’ are spoken in their senses of hide away and section at the end of a book, I prefer SEE-kruht-id and APP-uhn-dix over the usual seh-KREE-tid and app-EHN-dix. Sometimes my tastes even change. I used to prefer ‘dipthong’ as a descriptivist, but now I prefer ‘diphthong’ as a euphonist.

So I think the linguists of the world should unite and form a counterpart to the Plain English Campaign: the Euphonic English Campaign. And its slogan, I suggest with apologies, should be "sounds good to us".

Python Sonnet


Michael Spencer recently wrote a limerick in Python on the comp.lang.python newsgroup. I was challenged to write one in sonnet form, which I did in about an hour. I've used Shakespearean rhyme scheme, and the result is an actually useful program that gives the wordcount for some HTML input (usage: python wcsonnet.py filename). Only the alphanumerics in each line count towards the syllable total and metre:

from re import compile as regess
from re import sub as regsubstitute
doc = r'(?m)(<!DOCTYPE[\t\n\r ]+\S+[^\[]+?(\[[^\]]+?\])?\s*>)'
pin = regess(r'<\?(\S+)[\t\n\r ]+(([^\?]+|\?(?!>))*)\?>(?P<mute>)')

def stripPI(up): return pin.sub(' ', up)
def strid(down): return regess(doc).sub(' ', down)
def stric(sup): return regsubstitute('<!--(?:[^-]+|-(?!-))+-->', '', sup)
def strit(frown): return regsubstitute('<[^>]+>', '', frown)

def doTheThingWeHaveToDo(*perplexed):
   import sys; text = open(sys.argv[1]).read()
   WithThisWeProceed = stripPI(stric(text))
   BazBarFoo = strit(strid(WithThisWeProceed))

   print >> sys.stdout, len(str.split(BazBarFoo))
if __name__=="__main__": doTheThingWeHaveToDo()

Five hundred bonus points for anyone who can make an even more useful program using Petrarchan rhyme scheme.

Points of Interest


Shakespeare's Birthday (More or Less)


Today, April 23rd, is probably Shakespeare's birthday. There's no direct record of his birth, but his baptism was recorded on the 26th and this usually occured three days after birth. This is also the traditional day upon which it's celebrated, and were his birthday to become a international holiday as it should be then I'm sure the 23rd would be chosen (which means we'd get St. George's Day off in England). Still, it just goes to show that there are many enduring mysteries in the chronology of Shakespeare.

A lot is made about Shakespeare's "lost years", the period between 1583 and 1589 where we have little or no record about what he did, but I think this mystery is oversold. We want to know what he was doing in those years mainly because we want to know the genesis of his talents, but to assume that there was some tremendous event that occured in those years which made him into the genius we all know is likely just wrong.

There's an anonymous pamphlet of 1605 that speaks of "some that have gone to London very meanly", the wider context of which, including a reference to Hamlet, shows that this pertains to Shakespeare. Whilst we shouldn't read into it too much, it's easy to derive from this swipe that Shakespeare merely came to London to seek his fortune—we know that he was a very astute and successful businessman—so for example the story that he had to flee because he stole deer from Sir Thomas Lucy is almost certainly out the window, and I think anything else that's of similar scandalousness is unlikely too.

That isn't to say that Shakespeare wasn't... somewhat of a goer, and the only decent contemporary anecdote we have of him was of him entertaining a lady in the city, but again he doesn't seem to have ever stayed away from Stratford entirely for any large period at all. He frequently made trips back to buy property and visit his family, so if he were forced out he would've stayed out. Sir William d'Avenant claimed that Shakespeare would often stay in his family's inn, the Crown Tavern, in Oxford on the journeys.

The Haughton theory, that Shakespeare was a schoolmaster in his very young years, strikes me as odd but plausible, and may well account for the genesis that people are looking for: but even then you can't take it as accounting for the period directly after his marriage and directly before coming to London. That Anne was pregnant as they were getting married, and again with twins a couple of years later, points to the fact that he may have been kept busy in those years by quite the most obvious cause.

Google History Permalinker


Google's Search History facility stops you from being able to copy search result links for pasting into emails, code, and so on. This is because the links now have extra tracking information in them to let you know which links you've followed. I now find myself having to paste the link into an editor each time and chop it down, which is considerably irritating.

So I fixed the problem by writing plinker.user.js, a piece of Javascript utilising the Greasemonkey (see? copied that URI from Google) Firefox extension. There are quick instructions at the top of the script, and you'll need, obviously, to have Firefox and Greasemonkey installed to use it. What it does is to add a little permalink before each of the search results that has the actual URI in it that you can then right-click and copy, or even left-click should you want to temporarily bypass Google's tracking for some reason.

It might seem like too heavy-duty a solution, but the lack of a normal link is significantly jarring, as if someone made you sing Agadoo each time before you turned on a lightbulb. Eventually you'd just give up or go insane. The script should work on local Googles too, though I haven't tested that; Search History doesn't even seem to work there yet. Feel free to email me or leave a comment on this post if you have any questions or feature ideas.

(Third update: This behaviour still hasn't been fully fixed by Google, though they did change to using Javascript to track the links. David James has written an XMLHttpRequest version of the link hider, which is what I'd expect Google to implement eventually.)

Persistent Modular Storage


For a quick and easy way to save interpreter state across sessions in Python, wouldn't it be nice if module attributes set within the session persisted? With the permod.py module, they do:

>>> import permod, time
>>> permod.now = time.time()
>>> reload(permod)
<module 'permod' (built-in)>
>>> permod.now

It works by storing attributes written to it in a shelve database, which by default is housed at permod.data in the same directory as the module itself—though you can easily change that behaviour. Since it uses shelve, you can store anything that the pickle module can handle.

Something I noticed when writing permod.py was that of all the nice new-style types that Python has in its builtins list, it's missing one strange exception: the module type. You still have to use types.ModuleType at the moment, which seems odd when the types module could otherwise probably be deprecated.

The Oxyrhynchus Misreportings


The Independent are peddling more badly-researched, even deliberately misleading, material on the Oxyrhynchus finds, this time from Tom Anderson:

A newly discovered fragment of the oldest surviving copy of the New Testament indicates that, as far as the Antichrist goes, theologians, scholars, heavy metal groups, and television evangelists have got the wrong number. Instead of 666, it's actually the far less ominous 616.
Revelation! 666 is not the number of the beast

There are two levels to the misreporting malefaction here: i) the fragment discovered was published way back in 1999; and ii) the number 616 has actually been known as an alternative for the number of the beast since its origination, and never forgotten. The discovery of the papyrus, catalogue number P. Oxy. LXVI 4499, over five years ago merely added another important early reference.

The evidence for the first point is all over the web. Peter M. Head reviewed the document in the 51st Tyndale Bulletin in 2000, and the official Oxyrhynchus project site at the University of Oxford even has a page about the 616 mention, last modified 20th August 2004, containing an image of the papyrus. There are a handful of other mentions strewn across the web too. Even if yet another fragment has been found with the same alternate number, which I find no evidence of, the omission of facts is still contemptible.

For evidence of the second point, we can turn to A Key to Christian Origins written by Dr. Paul Lewis Couchoud back in 1932, and as requoted by Christopher C. Warren:

The figure 616 is given in one of the two best manuscripts, C (Codex Ephraimi Rescriptus, Paris), by the Latin version of Tyconius (DCXVI, ed. Souter in the Journal of Theology, SE, April 1913), and by an ancient Armenian version (ed. Conybaere, 1907). Irenaeus knew about it [the 616 reading], but did not adopt it (Haer. v.30,3), Jerome adopted it (De Monogramm., ed. Dom G Morin in the Rev. Benedictine, 1903).

Google uncovered that much, but there's plenty more. Jon Hanna mentions, for instance, that Robert Graves wrote about the number in his book, The White Goddess, published in 1948. For an obscure fact on the role of gematria in Christian eschatology, it sure seems to get around.

Jon Hanna argues that the bad reporting is “to allow those who hold no significance to 666 to laugh at the expense of those who do” but under-reporting or lying doesn't seem like the best way to go about that, and moreover this isn't an isolated incident. If it were just an isolated incident of five year old news being reported as new, we could perhaps brush it aside. But this is just the latest in the series of bad reporting on Oxyrhynchus by the Independent, for which it has been chided by many papyrologists, and on which the best exposæ so far written is that of a New York Sun article, with further coverage from Arstechnica.

That the Independent have jazzed up stories to make them sell is nothing to write home about, but to have plainly misrepresented such an academic topic when those who are most interested in it can and have easily found out the truth of the matter is at least rather baffling.

Blue Moon


Ever get tired of high-volume, low-quality news? Surely everyone does: you open up your news aggregator every day, then manually sort through the 90% or more dross left as a consequence of Sturgeon's Law just to find something that you're really interested in. Sites are starting to get a little smarter at offering what you want—Google News lets you put custom news searches on their news page, and weblogs allow you to catch up on specific friends' activities—but a huge part of the problem is that common media is generally just churned out as a low-grade commodity.

So Terje Bless writes about a new site that he has in mind, a site which by its very nature would provide a focal point for high quality news items, reversing the trend of low-grade crap. The general tenets are: i) the field doesn't actually matter as much as people think it does, as long as the material is both well investigated and well presented; ii) it's the people reporting that matter, and the people will naturally congregate around certain subject areas anyway; iii) it's the quality of the output, not the quantity, that matters. And point three gives rise to the name of Blue Moon, and the particulars that Terje goes into in his article as linked above.

I'm not particularly concerned about the technical workings at the moment; I'm looking for large societal backing. So I'm mainly looking at you, Kevin Reid, you, John Cowan, and all the other great people I know to be reading; but if anyone is interested in the ideas expressed herein, we'd like to hear from you too. Either leave a comment on this post, or visit the swhack IRC channel.

It's the Oxyrhynchus story previously covered here on miscoranda that set this into gear, but really anything could have done it. In fact, Terje was musing just a few months ago whether he "could persuade sbp to run an 'Etymology/Source/Trivia of The Day' blog for all the wonderful nuggets that get alluded to on Miscoranda..." From that nudge, I started writing a post about guillemets which I haven't yet finished, but the principle has been carried through to Blue Moon.

Blue Moon would combine the best aspects of wikis, weblogs, periodicals, and ezines by stealing editability from wikis, ease of posting from weblogs, identity and careful composition from periodicals, and the wide distribution in various formats from zines that weblogs are slowly starting to make redundant. If there is critical mass of a group of people dedicated to interesting linguistics and computer science trivia, why shouldn't we package up what we're discussing on a day to day basis in a more public-friendly format, and then cherry pick the best parts for a single site?

If there's one thing I've learned from TNS, miscoranda, eph, numerous wiki installations, &c., it's not the technology behind it that counts, it's the amount of interest there is and the volition of the publisher or publishers to keep coming up with good material. John Cowan still writes many wonderful stories on IRC that wouldn't go amiss on his weblog, and Kevin Reid went from December to April without posting in his—worse than me! But the community feeling of sharing anecdotes is such that we'll talk endlessly about such topics on IRC, and Terje seems to be wondering (and I'm certainly wondering) why we can't translate that to a more static showcase that others will more easily be able to enjoy on a periodic basis than trawling through thousands of lines of logged conversation.

A blue moon can mean many things, but most commonly it's taken to be the third full moon in a season, or the second full moon in a month. It was first used in 1528, according to Wikipedia, and we all know the great Rodgers and Hart composition as sung by Billie Holiday of course. I don't really expect that we'll get the levels of involvement required for such a project to succeed, but I think the idea is certainly worth pitching out there just in case it does make it happen.



The hard drive of the server that miscoranda.com was previously hosted on, vorpal, hit a slight technical snag recently causing it to grind itself into dust. It has seriously quite pulverised itself. All of the newly recovered material is, therefore, temporarily being hosted courtesy of the eleemosynary Christopher Schmidt on his server athena, which he also let me name.

As for poor vorpal, the name has been retired and The Benefactor is naming its replacement manxome, for obvious reasons. Though the name is rather good and there's really not much other choice (such a shame we're not up to galumphing in the poem), it's a bit hard to remember. Indeed, the person who wrote the server's MOTD seems to have thought it was called "manxone". I'm currently using the mnemonic of a house on the Isle of Man.

The sites inamidst.com and swhack.com were also hosted on vorpal—along with aaronsw.com, notabug.com, blogspace.com, zpedia.com, and probably others that I don't know about—so there's been quite a bit of hmming all round. For the sites I maintain, I've managed to recover all of the important parts from plenteous local backups, Google Cache, and the Web Archive, but I think Aaron's having a much more difficult job with his. There's a plea on his newly restored NYTimes Link Generator that you can check out if you want to help.

This is the first drive that Aaron's ever had fail, and though I doubt that's any consolation, it's pretty impressive given the number of drives he's got.

Anyway, there may be a little miscoranda.com downtime as I migrate from athena to manxome, though that won't happen for a while yet as Aaron's still setting up the box. Having vorpal go is rather a sad event, but it's reminded me that having practically unlimited space and bandwidth for free from friends is something I really should be thanking them for more often, hence the following very public thanking: Thanks Aaron! Thanks Christopher!

Ice and Snow


Various news items, none of which are particularly related to ice and snow. This is late springtime, after all (Daniel Biddle, the Australian Goverment and me all agree that Spring is Mar-May, Summer Jun-Aug, Autumn Sep-Nov, and Winter Dec-Feb. Well, the opposite in Australia, but you know what I mean).

Serving Planet Swhack:

Though the Blue Moon discussions haven't amounted to much yet, one thing that did come out of it was planet.swhack.com. This is an aggregate of all the Swhackers' weblogs, and, as such, kicks all kinds of ass. If you've been following the drive situation, you'll be interested to know that Planet Swhack is on athena. The rest of swhack.com is, however, on manxome. Both miscoranda.com and phenny are back on manxome too, whereas inamidst.com is to be load balanced across both manxome and athena. Is that all clear?

One of my strict rules for inamidst.com is that it's entirely static: the copy that I have locally and the copy on the server should match exactly, and that meant that recovering it was a simple case of transferring the local copy to the new servers. On the other hand, it does mean that various services I'd like to run dynamically would require me to run unison instead of rsync, which is a bit too much hassle when you're syncing between three drives as it is (and more when manxome goes RAID).

Keywords and Meta-Databases:

I wrote a 2000 word rambling post recently on Christopher's noets installation about keywords and databases for metainformation. It's a bit of a painful podcast-like piece of crap to read, but may be worth it if you're as fed up with hierarchical file systems as I am.

The idea is that you impose an abstract filesystem layer based on RFC 822-style header tagging over the normal filesystem. Then you create interfaces that enable people to search for data using a conjunction of the two. This would work really well if filesystems just supported extended attributes out of the box more.

Graphing Emails:

I'm pretty bad when it comes to emailing people, half the problem being that I don't have a good grasp of email ettiquette. That's especially true for knowing how long to wait beore sending a follow-up email to someone who hasn't replied—if you should send a follow-up at all.

So I thought about graphing the response times that I get from people that I've already sent emails to. It should be easy to extract from an email database all of the threads and dates, and then for each person that I know I'd have a nice probability graph. For people that I've not emailed before, I could average all the graphs I currently have, or perhaps create a new graph out of all the first emails that I've sent to other people.

Then it'd just be a case of setting a particular threshold and trying to adjust that based on how comfortable people seem with the follow-up time. It'll likely vary wildly, but it'd still be better than the miss-and-miss system I've got going at the moment.

[Insert standard plea for anyone who already knows of such a system to let me know here.]

Imkrozh - An English Cipher


I recently came up with a cipher of English that, whilst still being very obfuscated, looks as though it's an Indo European language. The idea was to map voiced phonemes to unvoiced ones and vice versa; and to rshift vowels along the sequence "aeiou".

If English had a more regular orthography it would work a lot better, of course, so it might be fun to port the cipher to another language. In any case, the first sentence I said in the Imkrozh, "Hirru thili, cem emyputy dirr whed o'n zeyomk?" was successfully deciphered by Morbus Iff; and the person that inspired it to some extent, who conflates voiced and voiceless phonemes sometimes, was also the first to fully decipher it.

I wrote a small script, imkrozh.py, that can convert from English to Imkrozh and vice versa. Though it's nothing more than just a play-cipher, it might prove to be a good puzzler to spring on linguist-type friends.

For reference, here's the first paragraph of this post in Imkrozh: O licimdry ceni ab woth e cobhil uv imkrozh thed, whorzd zdorr piomk fily upvazcedit, ruugz ez thuakh od'z em omtu ialubiem remkaeki. Thi otie wez du neb fuocit bhuminiz du amfuocit umiz emt foci filze; emt du lzhovd fuwirz erumk thi zikwimci "eioua".

List-Dictionary Hybrid


The Listdict class implemented in listdict.py is a port of the array-dict type from Javascript to Python. It has the properties of both Python lists and Python dicts, as is illustrated by the following code:

>>> from listdict import Listdict
>>> arr = Listdict({5: 'chicken', 7: 'chickette'})
>>> len(arr)
>>> arr[5], arr[15]
('chicken', None)
>>> arr[5:8]
{0: 'chicken', 1: None, 2: 'chickette'}
>>> arr.pop()
>>> len(arr)
>>> arr['hmm'] = 'heh'
>>> arr.has_key(5)
>>> 'chicken' in arr
>>> 'heh' in arr
>>> arr.sort()
>>> arr.reverse()
>>> arr.truncate(2)
>>> print arr
{0: 'chicken', 1: None, 'hmm': 'heh'}

There's a test suite in the module itself which displays some of the other functionality of the Listdict class. The original idea was inspired by Jim Ley, who pointed out the lack of this useful feature prompting me to implement it, which I did in about an hour with commentary to the #svg channel.

Update: the code previously required Python 2.4, but it now works in Python 2.3 with the exception of sorting and reversing. The code is in the same place. Many thanks to Dave Pawson for the nudge.

Origins of Jabberwocky


Lewis Carroll wrote the first stanza of Jabberwocky for his family periodical, Mischmasch, in the late 1850s—some twenty years before the Alice books. It's thought that Menella Bute Smedley's translation of The Shepherd of the Giant Mountains in 1846 forms part of the inspiration for the poem, but something that appears prominent to me is the connection with two lines from the opening scene of Hamlet. The similarity is remarkable:

The graves stood tenantless, and the sheeted dead
Did squeak and gibber in the Roman streets:
Hamlet; Act I, Scene i
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
Jabberwocky; Stanza I

But oddly enough, this similarity appears to not be mentioned anywhere else on the entire web. Does anyone know of any comparison drawn between these two elsewhere? Really the only difference is that the terms have been Carrollised and the verse is in octameter instead of pentameter. Perhaps this is just so obvious that it doesn't bear pointing out?

Link in a Soupstack


The problem with getting links from HTML is that the HTML you find lying about on the web is often quite broken—with broken being here defined as "that which Python's sgmllib can't parse". I wrote a little script called getlinks.py that extracts all of the links from an HTML file, but had to rewrite it almost immediately to take care of a page which had a comma after a <meta> element attribute value. I would've thought that sgmllib could cope with that, but I had to write a regular expression screen scraper instead. It's a pretty good screen scraper, though: it even properly ignores comments and CDATA sections.

John Cowan's TagSoup software is meant to do something like the above. It processes the HTML input, no matter how bad it is, and provides a series of regular SAX events. Its one problem is that it's written in Java, and JC seems to be in the not-quite-but-almost soliciting a port stage. I'm thinking about it, using a variant of the approach that getlinks.py takes. It's in its very early stages at the moment, but it would be nice to compare with TagSoup; and if it doesn't compare favourably, I may even just port TagSoup as it is, presuming that Terje doesn't beat me to it in Perl.

I wrote getlinks.py some time ago, but haven't published it until now since it's been lurking in my development folder waiting for me to publish it publically. Quite a few other files are lurking in that development area, though I've been going through it quite a bit today. One of the problems is that I like to be sure that I'm happy with the URI I'm publishing to so that it'll be appropriately cool in the TimBLian sense. But I don't like hierarchical filesystems, so one of the things that I've been developing is a meta-database that lets me add arbitrary metadata to all of my published files. One of the properties is "keywords", which means I can sort my files using a kind of virtual folder setup.

The benefits that it's brought about already, such as being able to automatically generate sitemaps etc., are enough to make me think that it's a valuable system, but it's got a way to go yet. I've even written a little shell interface to it, so that I can augment the metadata properties of the file in a fairly transparent manner. All of the actual meta-database content is just RFC 822-style headers in regular files, anyway, so it's all recoverable in the normal way.

I know that I already wrote about this a little in a previous entry, but the benefits are incrementally obvious and have been building quite a bit since then. For example, I was able to assign <priority> values to my Google Sitemap based partly on which tags I'd assigned to a document. I suppose it's somewhat folksonomical, though folksonomies are quite arbitrary, community based, and useless, whereas with this system each keyword that I assign has a specific effect around the site: from altering the sitemap priority to appearing in some index to having highlighting added, and so on.

Duck Egg Blue


The Old World robin, of the family Muscicapidae, lays eggs of various colours, but according to the BBC, all have "a whitish ground colour, [and] a speckling of reddish or reddish-brown spots". The New World robin, on the other hand, of the amusingly named Turdus genus, lays eggs with a very striking turquoise colour. So much so that the crayon company Crayola has, since around 1993, produced a crayon called Robin's Egg Blue. Crayola's robin's egg blue (#00CCCC, ) is much darker than the robin's egg blue I've used for miscoranda's redesign (#E6F2F2, ), but the darker colour is much too overpowering on the screen.

According to Wikipedia's article on the name Robin, "In England it is generally regarded as a male name, although it is sometimes given to females. In the U.S., it is more popular as a female name than a male one." This is somewhat suprising since the European Robin is much more small and sleek and cute; one would think it'd be a female name in England. Perhaps in some other parts of Europe it is.

John Cowan wonders whether the connection of Robin to Robert (etymology of robin: 1549, shortening of Robin Redbreast (c.1450), from O.Fr. Robin, personal name, dim. of Robert) is remembered in the U.K. to a greater extent than the U.S.

John also noted that he's "fond of the bit in The Once and Future King where the Wart (the eponymous hero) hears of 'Robin 'ood', asks who he is, and is told, 'Nay, since th'art a scollard, tha must speak his name scollardly', i.e. 'Robin Wood'", which got us wondering about the word scollard. The OED doesn't list the word, but does contain "scollardicall", which then references sense 3.c. of "scholar", which is "Freq. in vulgar or dial. form scholard, schollard, etc."

In the United Kingdom, it is more traditional to refer to robin's egg blue as duck egg blue. One such duck that lays blue eggs is the Dutch Hookbill, which was referred to "in Willughby's Ornithologie in 1678", but may have an eastern origin. I wonder what kinds of avian blues there are in Chinese?

The Nibtrick Penomenon


Earlier this year I set up a new weblog to write on various antiquarian, astronomical, linguistic, and historical matters, leaving miscoranda static until I figured out what to do with it. Meanwhile the registration for miscoranda.com was lapsing, and an unknown benefactor paid for it, preventing me from shutting the domain down, as I had been planning.

I'm quite glad now that it's still around, because the other weblog has spawned a further much more involved project that's becoming increasingly difficult to coördinate, leading to the fact that I'm not really writing prose on a consistent basis. So I'm hoping that the new duck egg blue theme for miscoranda will inspire many insipidlessnesses, notwithstanding the rather convoluted history of technological and linguistic mishmashing that this weblog now has.

My latest schema or manifesto for what to produce here is converging upon the thought of just getting back into writing, so the technological and linguistic mishmashing will undoubtedly continue at least for a little while. All the same, the Other Place has been such an interesting project that it will probably have some influence here, as it already has done with the previous duck egg blue entry. Indeed, the whole cleanliness of the design and backend is based on the lessons learned from the Other Place, though the new design itself is based upon the Pyllbox homepage design—Pyllbox being, somewhat ironically, the software that used to power miscoranda. Now I've just cobbled a few Python scripts together and I run them on the server side using a Makefile so that the pages are basically static.

So, to not oversystematise too much, expect to see more of the same crap for a while until something else interests me. I'm rather hoping to continue the antiquarian themes through to miscoranda, though my writing on that subject tends to be highly information rich and therefore hard to digest. I could learn a thing or two from Schoenbaum &c.

Every so often I like to throw in some words of the day too, so today's words are: neorxenawange (Old English), penomenon (by Mina Loy), nibtricks (by Chris Onstad), unblammoable, and the twinling acronyms AIYCA and AIOCO. Oh, and twinling, which of course harks to twyndyllyng. And symphysy, for that matter.

Emano and Editing Trends


It's been just over a year since I wrote about emano, the text editor that I'm hacking on, but I'm still editing this entry in nano and not entirely enjoying the experience. Indeed, my development process for emano is such that I only work on it when I'm sufficiently annoyed with whatever editor I happen to be using at the moment—and I feel myself getting close to having another emano coding stint.

The next stage that I want to work on is organizing the code roughly as it'll be when I release it as a package, and doing more tests-driven coding, and more documentation. In other words, the things usually tacked on as an afterthought, but that really should be done whilst the coding is going on. The latest code isn't public because I haven't done the reorganisation yet, so that gives me more incentive to clean it up.

I've been using vi, or rather vim, a lot recently and working on some scripts for it—and it hasn't dissuaded me in the least from carrying on with emano. It has made me think about key bindings. The aim is to have the most useful set of key bindings, but since this varies across both person and even time it's nigh on impossible to reach the optimum.

But at least there seems to be a general tendency (or is this just me?) to invest too much time into setting up key bindings, macros, or other shortcuts that then deliver less than the investment. It's easy to learn a key combination, but it's not easy to have it become a reflex—it doesn't just take a long time, but messes you up until you've learned it.

I'd like to set up some kind of task analyser that measures the most frequent editing procedures that an individual takes, and then identifies the trends that, over time, would be the most effective to make shortcuts from. That way at least it wouldn't be random, and then the next step would be to measure what the all-important frequency threshhold is for shortcutising a procedure.

But that's a lot of work. The ghetto solution is to incorporate lots of tried and trusted key combinations from other editors. In nano, Ctrl+K is probably my most frequently used command (cut line to clipboard, appending to the clipboard if the line's been taken in sequence). In vi, I tend to use "Go" in normal mode quite often (go to the end of the document and then insert a new line below the current, i.e. last, line).

I prefer nano's reliance on a strong set of simple Ctrl mappings, but having one of those be a Ctrl+O vi simulation (i.e. do a single command in normal mode), might be a good idea. It's a shame that Ctrl is a bit awkward on Apple laptops—perhaps there's a way to bind to the Apple command key. As for emacs's Ctrl+backsplat+wacky+boombox combinations, I figure I'd like to foist various stuff onto programmable Alt combinations. So you'd have a set of Alt+keyname combinations, and then you could use Alt+number to switch between them, or some such contrivance.

I'd also investigate having the default Alt bindings be slightly vi like. For example, Alt+l could be yank line; Alt+p yank paragraph; Alt+o append a new line below the current line. Obviously, I like the vi mentality but I really despise modes. And I like emacs's power but I really despise clutter. And I really like nano's robustness, but I really despise its spartanity. I'd really like emano to sit in the middle of those three editors, perhaps with some throw-in features from the less well-known, more exotic editors.

The Floredelise


The first recorded use of the word fleur-de-lis in English was in 752 as "flour-de-lys", and refers to the Iris rather than the heraldic device. Both the word and the device have a long and interesting history, including over a dozen variants of the spelling, and the fact that the fleur-de-lis as a sign of England's claim to the nation of France was taken out of the Royal Standard as late as 1801, by George III. I had decided to use a Unicode symbol as miscoranda's latest logo, and U+269C (also known as "FLEUR-DE-LIS", to give it its full unicode name) fit the bill wonderfully, notwithstanding strong competition from U+3020, U+2604, and, since it's winter, U+2603.

Though fleur-de-lys has quite a few spelling variants, it doesn't come anywhere near to popinjay, the old word for a parrot, which could variously be papageye, papeiai, papeiaie, papeiay, papeigai, papeioy, papengay, papenioye, papgay, papiaye, papingeay, papiniay, papyniay, popegaye, popeiay, popeiaye, popengay, popingay, popyngay, and so on. The form "popyngay" looks rather close to puffin, which seems like a reasonable cognate given that puffins are the parrots of the sea, but the OED is rather unsure about the word puffin. It first appeared in 1337 as poffoun, but since the word is from Cornwall and the Isles of Scilly there is some hint of a Celtic origin for the word. Indeed, Andrew Breeze noted recently in Notes and Queries that Breton has "pochan" as a cognate, reinforcing that theory.

Both popinjay and puffin are, this week (of 2006-01-02), free to look up in the OED through the BBC's Balderdash and Piffle series. On the back of that, Kragen Sitaker mentioned his OED Interface using the old public domain version of the dictionary from back when it was called "A new English dictionary on historical principles; founded mainly on the materials collected by the Philological society". Daniel Biddle found the term ablewhackets, also spelled abelwhackets, therein, which is a kind of card game that sailors used to play where the loser was beaten by a knotted handkerchief.

Speaking of sailor's knots, Mark Shoulson discovered the word "cuntlines" in Ashley's Book of Knots (it's also in Steel's Elements and Practice of Rigging and Seamanship), which refers to the grooves in a rope as it's twisted around. This word is of uncertain derivation, though the prefix cunt- is a variant of cont- which may be a variant of cant in its original sense of nook, which gave to "one of the side-pieces in the head of a cask" and some other nautical derivations.

To complete the symbolry, the current favicon for miscoranda is taken from the header of page 149 of the First Folio of Shakespeare, i.e. the italic M in "A Midſommer nights Dreame." It's been enhanced a little to change the background from beige to white and adjust the weight of the character, but otherwise it's per the original. The odd thing about it is that it looks very modern, even when you try to forget that it was cut in 1623 or before. So I uploaded the heading of page 149 from the First Folio, and then ran it through What The Font?, the online font identifier program. At first it replied "Bad WTF response: (empty) Last request: SEARCH 0000044f43bcbb83000c562b00004316 30 200 2 ? 10 [etc.]", so I thought that perhaps the font was too ancient after all, but on a retry it came up with the fairly close Van Dijck MT Italic, the too modern but still beautiful Baskerville Nr1 SB-Ita, and the not so good match but still very antiquarian looking P22 Mayflower Italic Regular.



Thanks to some character encoding and translation fun (okay, rigmorale), I was able to chat with a Russian guy today in his native language, even though he knew very little English and I knew only two or three words of Russian.

My IRC client interprets all incoming bytes as cp1252, the standard Windows character set. This is usually annoying, but on this occasion it proved helpful since the guy dropped by unannounced in a public channel and started typing words such as "ÎÁÄÅÀÓØ". I've been through this sort of thing before in other channels, so I was able to guess from experience that the character encoding was iso-8859-5, a Cyrillic charset. In fact, he was using the other popular Cyrillic charset, koi8-r, and he understood enough of what I was saying in English to inform me of that.

Then I recoded his koi8-r as utf-8 and spat it back at him. He realised that I meant we use utf-8, and he switched to utf-8 for us. I was at least able then to read the Cyrillic in the HTML logs, but I still had no idea what it meant. When this has happened in the past, I've Googled for Russian-to-English translation services but not found any decent ones. But I thought I might as well give it another go, and I came across PROMPT-Online, which turned out to be awesome.

So then I was able to conduct the conversation as follows. The Russian guy would type a line to me. I would copy that line, in utf-8 wrongly interpreted as cp1252, and paste it into my Encoding Normaliser service with the option u-cp1252 to utf-8. Then I'd copy the result and paste it into the Russian-to-English translation form to find out what he's said. When I wanted to paste a reply to him, I'd type my response into the English-to-Russian conversion form, and copy the result into the Encoding Normaliser again, this time selecting - to pyraw, which prints out escaped byte sequences such as \xc3\x8e\xc3, etc. Then I'd take that and put it into Python, saying something like "print '\xc3\x8e\xc3[...]'", which would print out the bytes for me, ready to copy into IRC and send.

This made for a rather slow conversation, but it worked! Barely a word of one another's language understood, and we still managed to communicate pretty well—to about the level that two 10 year old pen pals would communicate, I would say. One of the factors is that you have to write in extremely simple terms into the translation service so that it doesn't get mangled; but not only do you have to be simple, you have to be translation-unambiguous. In other words, you have to avoid words that have multiples senses. One additional technique that I used was to put synonyms in parentheses after particularly difficult or ambiguous words.

I suppose for French and similar languages, this process would be no big deal, something not even worth commenting on, and so really the main barrier was the accessibility problems that Russian poses to non-Cyrillic language speakers. I'm really glad that it was possible to overcome it. Incidentally, прикольно means "cool". The translation service didn't know that, so I Googled for it instead.

Chronologies of Shakespeare's Plays


The exact chronology of Shakespeare's plays is unknown, but that hasn't stopped plenty of scholars over the years from refining our knowledge of the publication dates. Sadly, even after centuries of effort by the most learned historians, our knowledge remains fuzzy. Each chronology is an approximation, and as such, editions of Shakespeare's works ordered by date vary. But although we'll never be sure about the exact dates, not least because many of the plays were rewritten and revised through the years, we could do much better in presenting the knowledge that we do have.

Chronologies of Shakespeare's plays tend to take one of two forms: explicit lists of when the plays were written, normally with date ranges; and implicit lists of plays, sometimes with dates in prefatory notes. When I wanted to illustrate my chronology of Shakespeare's life with a chronology of his plays, I commissioned Cody Woodard to make a graph of the plays for me, with the spans running from the terminus a quo (earliest possible date of composition) to the terminus ad quem (latest possible date). This turned out pretty well, but it's possible to go further.

Not only were the plays rewritten and revised throughout the years, but some of the plays were almost certainly collaborations, and even the plays entirely by Shakespeare took time to write. Some may have been unfinished projects from many years earlier that he then reached a breakthrough on. Nobody really knows, but there is usually a body of evidence that can pin down the dates of the plays, internal and external.

Internal evidence falls into two categories, that of internal style and internal references. Style refers to the kind of play that is being written, and its rhetoric; you can generally tell, for example, that A Midsummer Night's Dream and Romeo and Juliet are of the same period because their language and their plots are so similar. Same with, say, the great tragedies later on, and the historical series. Internal references are notes to current events, ideas, people, and other things of the time. Some may be extremely brief and tenuous allusions, and of course it's difficult to tell whether some are later additions. But they can, nontheless, be very helpful.

External evidence is mainly what other people have written about the plays. This can range from when the play is registered for copyright to diarised performances of the plays to notes scrawled in the margins of books owned by the Elizabethan literati. Usually, the printing date of a play is used as the absolute terminus ad quem. But sometimes there is infallible earlier evidence, such as the fact that Henry VIII was said to have been played just a couple or few times previous to the performance that burned down the Globe Theatre in 1613—so it must have been finished by then, completed probably not long before. Incidentally, Henry VIII was known only as All Is True until the publication of the First Folio in 1623.

So there are shades of evidence, and the shades of evidence quite possibly reflect what is a very chronologically complicated process, that of the writing of Shakespeare's plays. Yet it's common to represent this merely by saying that, for example, play N was written between dates P and Q. My graphing idea was really no better than that, but I was thinking towards better pastures: for example, each type of evidence could be noted along the date range line. The believability and quality of the evidence could be noted by the weight of the line; and in books each piece of evidence could be footnoted, or, ideally, online each piece of evidence could be hyperlined, or set to expand when hovered over. SVG would be ideal for this sort of thing, though it'd be a fairly big task.

Elif Outside Try, For, and While


In Python, you can put an else statement after try-except, for, and while. After a try-except, it'll be executed if there was no exception; after a for or a while it'll be executed if there was no break out of the loop. For example, this allows the following fairly common idiom:

for whatever in something:
   if meetsCondition(whatever):
      result = whatever
else: print "Couldn't find something that meets the condition"

Unfortunately, it's not possible to use elif instead. I think that this is an oversight, and it'd be a lot easier for beginners to learn the language if try-except, for, and while just acted as though they were if variants. It would allow the example above to be extended to something like the following:

for whatever in something:
   if meetsCondition(whatever):
      result = whatever
elif options.verbose:
   print "Couldn't find something that meets the condition"

Instead of doing:

   if options.verbose:
      print "Couldn't find something that meets the condition"

Which is ugly and inconsistent. It would only require a very trivial change to the grammar, and complements the try-except reforms proposed in PEP 341 nicely. See also the conversation that led to these musings.

⚜ ⚜ ⚜

Incidentally, Kevin Reid (for whom the section-dividing fleurs-de-lis above are intended, though not many fonts seem to contain them) pointed out that the else syntax above is not all that intuitive anyway, and that in the case of a for loop could easily be misconstrued as only executing when there were no iterations in the loop.

If this were taking place in a non-Python language, a language where the global variable space and the user variable space were separate, I'd drop various facts about the loop into the global variable space. For example, "broken" could be set to true if the loop was broken out of; "iterations" could contain the number of iterations that the loop had; and there could be lists of exceptions caught, and so on.

URI Design


I'm not very good at choosing URIs for documents on my domains, which is to say that I'm not very good at choosing URIs which I won't later change my mind about. This is because websites and workflows are fluid and the organisation of a website is an inherent part of the data: I update a document's location just as I would update its content. But because I also believe that Cool URIs Don't Change, this creates the classic URI design problem.

Even though the web has been around for well over ten years, we're still in the very early stages of learning about URI design. Starting in, apparently, 1997 or 1998, the W3C started using one of the now most popular kinds of URI design amongst the thinking population of the web, that of datespaces. The idea is that since you shouldn't move URIs, the only information that you should put in the URIs is stuff which doesn't change. And what doesn't change about a document or thing published on the Web? The date on which it was first published, of course. So paths like /YYYY/shortname are quite common at the W3C.

But even the W3C isn't consistent in this practice, and the system has its detractors, especially amongst people who can't remember dates very well. The arguments for and against the system get more complicated, but again the main point to note is that there's no great resolution. It's still more art than science at the moment.

I bought a new domain a couple of years ago and set up a site thinking that after having studied URIs and site design for a few intense years I was ready to make things work without messing up too much. Now, two years later, I'm wondering if there's any system which even comes close to working for URI design. When I first set up the website, I identified several different schemes whereby people coin new URIs:

I'd previously been doing some a) and b), but had decided by then that b) with some e) was a better idea. I should've known that establishing URIs based on ideas from the FHS was not a good idea; the FHS is quite horrifically designed and out of date, and the calls for its reformation have gone so far as to actually spur people to create deviant linux distributions that no longer use it, such as GoboLinux. The reasons why GoboLinux uses its own non-FHS hierarchy have been explained fairly exhaustively by its creator, and make a fair overview of some of the ways in which the FHS doesn't pass muster.

The problem with approach b), that of short and snappy names, is that it's exceedingly difficult to come up with a name that is "good enough", yet alone perfect, when it comes to URI design. TimBL's original Cool URIs article linked above mentions quite a few of the problems involved with this, but the problems extend to some highly specialised ones which will be different for each particular site that's being designed. For example, when I wanted to set up a weblog on my new site, I wanted to avoid /weblog/ and /blog/ and any variations thereupon, and instead opt for something very neutral. I couldn't think of any abbreviation of the weblog's name that would work, so eventually I plumped for /notes/, except that it still didn't really describe the weblog properly (the notes changed from notes to full-blown essays), and moreover I had already been using that directory for something else, so I had to make it dual use, which was pretty confusing and effectively quashed the old use I had for it.

So what, you may ask, about redirecting things if you're going to move them? The problem with that is that you're stealing paths from yourself, and you're also having to maintain the redirects which, if you've only got .htaccess server configuration files to play with, can get to be very inefficient if you have a huge number of redirects. Apache will read the .htaccess file for every single request that it gets. Sometimes I get around this problem by writing a CGI that has a bunch of redirects coded into it, but that's only really possible if I've moved an entire directory. Moreover, I expose as much of the internal workings of my site as possible, so the CGI script would be visible, which is fine by me but when I move something I want to retire the old URIs and make sure that people are using the new URIs. Otherwise I wouldn't've moved it in the first place. So I generally get around to robots.txt filtering out the old directory (again, if it's a directory I'm moving), which makes me worry that search engines will screw up those pages' rankings.

Another way around this would be to have a completely different server configuration not using the filesystem for its backend, perhaps using a database filesystem, and perhaps even built on top of Apache. But the problem with experimental systems like that is that they are, well, experimental. If you're using a database you have to worry about your data getting corrupt, and migrating to other database systems in future, and so on. It makes your data less easy to access than if it's just floating around in your filesystem. I'm not sure what the perfect experimental filesystem for serving files via HTTP would be, but I've often thought about it; and I think that something that did revision control internally is a must, as well as perhaps identifying each file by just its hash, then enabling you to link that with various paths using a simple (but huge) path to hash to file mapping. On top of this I've imagined many URI systems, such as only allowing /[A-Za-z0-9]+ URIs (no more "/" segments!), and then if there are duplicate filenames, you just provide a disambiguation page instead of the actual file you're looking for. But already you can see the flaw in the system—namely that it's perplexing, hard to manage, and doesn't help people find what they're looking for. It does almost entirely remove the URI design question, but at a high cost.

So if URI design is expensive, it's because it has benefits too. Shorter and clearer URIs are easier to memorise, and so your pages become easier to recall in the future, and easier to refer to other people. Moreover, when you use short and clear URIs, people can even start to form impressions about how your site is structured from the URIs alone—that's what I do with the best of sites, at any rate—and many will know that you're a good designer. The ramifications can be very subtle, but cumulatively URI design is a very important thing. First impressions count: recent research (via BBC News) has shown that people evaluate the quality of new web pages in under 50 milliseconds.

I somewhat envy people who are able to design their websites according to some scheme and then stick to it; especially people who have very large base URIs for their sites, such as people who are using academic accounts or have homepages on other people's servers. One of the first sites to grab my attention in this way was that of Sampo Syreeni. Other people to have achieved zen-like levels of site design quality include Ian Hickson, and, of course, Dan Connolly. Some of my friends are also rather uncannily good at it: Aaron Swartz especially seems to invest not all that much time in URI design but due to enormous amounts of experience is really good at it and rarely if ever changes a URI. But most of all, Morbus Iff manages to excel himself when it comes to categorisation.

If nothing else, Morbus is a heavily repressed and frustrated librarian who, having never actually been a librarian as far as I know, works off his needs on his huge collections of movies, books, files, games, comics, magazines, and so on. He's a collector, and so he has to categorise by necessity; and he's the kind of guy who has to get things done right without being a perfectionist about it. He's a kind of pragmatic perfectionist. The best place to observe his tendencies is probably his lists directory. Note for a start how each of the entries in even the directory has its own neat little label. Note also how the right hand side of each of the labels currently (2006-01) makes a kind of wave motion. With any other person I would say that this is probably random chance; with Morbus it is almost certainly deliberate, and even if it's not it's very indicative of the level of attention to detail that he invests into such things.

His directory consists of albums, bookmarks, ebooks, videos, and quests. There are thousands of entries in each category, sometimes managed by hand, and sometimes partially automated. If you look around the site, you'll find scripts for doing many of the kinds of tasks that are needed to produce these kinds of lists and keep things categorised correctly.

But even Morbus doesn't seem entirely sure about URI design: for example, I would bet that instead of www.disobey.com he would prefer to use just disobey.com now. Instead of putting his weblog at /dnn/ where it's been for several years, he's now moved it to the / page, the front page. He previously eschewed datestamped directories (like I have done on and off), but now he's using them, though not all that often. The list goes on and on: and each of these things, I can be sure, has a very distinct set of reasons behind them because this is Morbus and Morbus Thinks About These Things; but all the same, URI design is what it is. It's impossible to get right all of the time.

I chose Morbus as an example because he's as avid about categorisation and URI design as me, but one criticism that may be levelled is that it's not worth fussing over, and that off-the-cuff ideas are often the best. Whilst it's true that off-the-cuff designing can often be the best approach, I think it's unfair to say that URI design isn't valuable, and furthermore I think it's unfair to say that it isn't interesting in its own right. It's an art and a science, and it's only slowly becoming more science than art, but all the same it's a distinct and flourishing hobby-out-of-necessity in some circles. By just thinking about it as a thing that we have to do, that's when error can creep in; and that's when links break. It might not be a particularly interesting hobby, but there are plenty of hobbies that I don't find interesting and yet still recognise them as hobbies. It's time to start recognizing URI design as a hobby, indeed as a discipline, in and of itself.

In a sense, it already has been recognised. For to design a URI is to decide upon a classification scheme for a published resource, and the history and the art and the science of classification is long and involved. For as long as there have been books, people have been wondering how to order them on their shelves. By size? By title? By date? By colour? By topic? By author? The system that's most popular in libraries is, of course, large-by-topic and small-by-author. The Dewey Decimal system, developed by Melvil Dewey in 1876, is one of the most well known of the by-topic classification systems, but you don't have to look far to find other obvious ones. Dewey classified works; Roget classified words. John Wilkins even proposed an analytical language, a language whose words were ordered according to a grand classification scheme, later essayed upon so lucidly and humourously by Jorge Luis Borges. Wilkins was a bit of a dreamer, or, as Borges put it, he was one who "abounded in happy curiosities: theology, cryptography, music, the fabrication of transparent beehives, the course of an invisible planet, the possibility of a trip to the moon, the possibility and principles of a world language". It's not surprising that we should find the construction of a world language at the tail of the list since it has been proven again and again (and this is the whole point of Borges's essay) that there is no such thing as a universal classification scheme. You can't even come close. All you can do is to create local classification schemes and hope they'll be suitable enough for some particular use that they've been put to. One of the reasons that I admire the sites of Sampo, Ian, and Dan, for example, is that they've successfully created such a scheme and employed it and stuck to it.

Wilkins was constructing a language; we need merely to construct a website. But for both languages and websites, there is one thing in common: they are generally creative endeavours. The things that words describe are fixed, but words themselves can take any forms. The files on a website generally have already been written before the URIs are chosen, but any URI can be chosen for them. Moreover, the relationships between words, and the relationships between files, are flexible, and variant across time and context and many other things.

It's the flexibility of the associations that is one of the biggest pains of URI design. For example, I like to arrange my works so that they're clustered. In other words, I like to make sure that I don't have directories with thousands and thousands of files in them; it makes things harder to find. Nor do I like directories that only have one or two files in them; it isolates them and makes the URIs unnecessarily long. I mainly prefer large amounts of files to small amounts because at least the URIs are shorter, but that's a point for me to elucidate in a moment. My problem with wanting to cluster files is that I don't know how much I'm going to write about a particular subject in future. So I might start writing about Shakespeare, and I make a file called /notes/shaks in which I write about him. Then I decide that the page is becoming too long, and it's worth having a short URI for all the things, so I make a /shaks/ directory and start putting files in there. Then I decide that I'm interested in his life and time, so I create a /shaks/bio/ directory to hold more files. Then I decide that I'm only interested in his early history, so I create /shaks/bio/early/ and have to move early.html into separate files in that directory. Then I decide that all of that could do with a shorter URI and move it all to /shaksbio/. A half-real and half-contrived example, but you can see how the process just goes on and on.

So why cluster at all? If it's easier to have one big directory containing several thousands of files, why not do that? After all, the URIs would be short and there would be no specific disadvantage, right? In Ye Olde Daies, some filesystems couldn't even handle over a few thousand small files in a single directory, but hopefully things have moved on a bit from there. The biggest problem now is that some things just naturally need to be groups. Sometimes it's an absolute requirement, such as when I'm distributing some code and I need a directory to make the project.tar.gz file from. Sure I could make a manifest file, i.e. a list of all the files that will go in the distribution tarball, but that's a pain, and it's difficult to maintain.

To dip back into the realm of experimentation and most optimal solutions, perhaps if it were possible to mount a manifest file as a virtual directory, the one-big-directory approach would not be so bad. But then why not go the other way? Viz, having lots of directories on the filesytem but making Apache recursively check through all these directories when a single filename is requested? (My problem with the latter has mainly been that it's then difficult to spot duplicates; and structure is still important.)

Note that when the flexible and ever-changing associations between things isn't present, it's a lot easier to come up with a stable hierarchy. This is obvious. For example, taxonomies in biology: once the scientific classification of organisms was discovered, it wasn't long before the system was reasonably concrete. But even the classification of organisms has had and continues to have problems. When Linnaeus started his classification of organisms it was so as to better identify; it wasn't until Darwin that we realised that the hierarchies were founded on the principle of common descent. And the hierarchies can still be really complex: for example, it's not known exactly how many species of citrus fruits there are. It's not even known roughly. It seems that Walter T. Swingle, a lumper, says there may be as few as 16 species; Tyozaburo Tanaka, splitter, says possibly as many as 145. That's a pretty major discrepancy!

Recently, I've found that the biggest thing that can help in URI design is not rushing the process. This means that a lot of my URI design is conducted well in advance of publication, and I test out the URI design for a long time before, by using a temporary directory prefix in front of the projected path that I want to use. So if I come up with a path such as /hello/ that I want to use for a project, I'll put it under /temp/hello/ (say) until such a time as I feel it's ready to be moved to /hello/ itself. During the time that it's in /temp/, though, I won't be able to publically publish the URI and this is a quite significant drawback. And even this system is far from foolproof; it just ensures that I don't make silly quick mistakes. It also makes me worry a lot about URI design, and fret over the paths to choose; I have many directories that are waiting to be published where the only thing that now needs deciding is where they should be published.

I've even been keeping a text file about each particular URI design issue, and there are several sections in it. It's interesting to see the extent to which the design of the URIs is really the design of the site, so on that front it's excellent to document it in that one place. Designing a site by its URIs is like designing a bookstore by the titles of the books that it's going to sell: kinda fun! And, actually, useful. Quite a few of the issues that I've put in this URI design file have gone on to be resolved because I've carefully documented them therein and been able to refer to all of my thoughts on the subject over time and integrate them together and find the best solution. But one big irony is that I haven't published this URI design file yet because—surprise, surprise—I can't come up with a decent URI for it yet.

Note that even the URIs for this weblog, miscoranda, don't fulfill all of my requirements for a good URI. In brief the requirements are:

These requirements often go against one and other: it must be brief but palpable, memorable but persistent, applicable but aesthetically pleasing. There just aren't enough synonyms in the English language sometimes to be able to find a word for your document that you haven't already used and that looks good and is reasonably unique and so on... English, even though it's basically two or three languages smushed together (and more), doesn't have enough capacity to allow good URI design. And this is not to mention the fact that if you're using a more limited language you have even more of a problem.

As I was saying, a good case study is miscoranda, which is using URIs that are just integer based, such that each post has a number and each time I make a new post the number increases by one. This means that the URIs are very short—the path for this post will be /159—but what does /159 mean to anyone? Even I generally have no idea which posts were at which number, and I certainly don't expect my readers to. On another one of my weblogs, I have been using shortnames instead, i.e. brief and normally unique keywords contrived on the spot based upon the post's title, and though I thought I'd have a problem generating them and that I'd run afoul of my usual URI design issues, I've actually been fairly happy with them. I haven't had to move a single one so far. But that's a weblog, and a weblog is a relatively controlled environment of sequential posts; whereas a website can encompass any number of things and projects.

So, to conclude, URI design deserves a lot more credit as a discipline, even as a hobby, than it currently gets, and it's building on top of centuries of research with hierarchies and taxonomies and other classification schemes, but also has many new problems of its own. It's also a very unique thing, meaning that there is not just URI design in general but there are also many URI designs, a bit like snow and snowflakes. Each time you do something, you have to design a URI anew with a new set of principles, and though there are some requirements (as listed) that hold true for pretty much all URIs, each new project brings its own particular constraints and opportunities. This means that only experience can really help, and so time is an important part of the equation. Experimental solutions might gradually come into effect more and more in the future as one fundamental controller of URI design, that of the shape of our filesystems, changes over the years; but this is a slow process and in the meantime we need to make URIs that we'll be happy with both now and when any future advances occur.

This isn't talked about anywhere near enough, so if anybody has specific ideas about URI design they should feel free to reply to this entry to talk about it on the www-talk mailing list, or wherever it's most appropriate.

Good Advice on Strategies


Sean B. Palmer: "I fall into the trap of pegging a particular idea of requiring a certain level of QA and then not getting around to doing it because the extra work is offputting, even though releasing it at a lower QA and then bumping it up later would be much more valuable than keeping it to myself. I've not yet developed any reliable strategies for convincing myself to do that."

Dan Connolly: "Don't look for reliable strategies. Don't constrain yourself further. Just relax and release early and often."

Semantic Web Interest Group IRC Chat Logs for 2006-01-25

Another fallacy that I often succumb to along those lines is that Technology Will Help; i.e. that all I need is a better CMS or a better publishing mechanism and I'll be much more likely to write. it might be true and the perfect editor/CMS/website/whatever might be just around the corner, but in the meantime... DanC is lucky enough to have someone operate his weblog for him.

Having said that, I'm quite content to roll my own software, and actually there has been a steady improvement in how comfortable I've been with blog publishing software since I've been through numerous iterations of deciding what I want.

Validation at a Glance


I've been experimenting with various means of validating my site easily, and the other day I struck upon a good idea for at-a-glance validation: returning images based on the validation status of a referer. When you load an HTML document with images in it, most browsers will use the URI of the HTML document as a referer when fetching the image. So it follows that it would be easy to write a service that validates the containing HTML document of an image and delivers either a smiley face or a frowning face depending upon the status.

So I did just that: Validate With Logos. The documentation explains a bit more on how it works and how you can use it on your own site (it's just a small bash script and associated images). It's a bit slow because it uses the W3C's Validation Service: if you ask for your results in XML, the nsgmls output, it'll add an HTTP header giving the validation status. Long before XML-RPC webservices where hip and vogue, the Validator was providing a faster and more robust solution.

I have two test files for this service, validtest and invalidtest. One problem with the approach that I noticed from the tests is that when I load validtest it waits and fills in the image with a little smiling face, but when I go to invalidtest and it puts a smiling face in there too (from the cache) before regetting it and realising that it should be a frown. If I refresh them back and forth then they both do this—displaying the opposite status before the real one. I suspect that I could mess with some HTTP caching headers to change this behaviour.

Though the service itself is trivial, one of the things I most enjoyed in the writing of it was being pedantic with bash. I went through quite a few iterations of the twenty or so line bash script trying to make it as clear and as readable as it could possibly be. One of the biggest problems that I faced was getting the GET query string to be formatted clearly. I had been doing it like this:

   uri = ${URI//;/%3B};\
   doctype = ${DOCTYPE// /+};\
   output = xml\

And then joining it up in the URI with ${QUERY// /} (bash's neat replacement syntax for variables). The thing about this, though, is that I didn't like the trailing reverse soliduses, \, sprinkled all over the place. It hardly aids readability. So I asked the Swhackers if there was a way to remove all line breaks from a string in bash using only builtins and on a single line. I'd already managed to work out that this would do the trick:

VARIABLE=$(echo $(echo "

But it's not really more readable. The nicest approach would have been if bash supported line breaks in its substitutions. Now, technically it does since if you do:


The line breaks disappear, but hard-coding a line break into the substitution is certainly not optimal. It doesn't understand either the quoted or unquoted "\n" character escape syntax, and nothing I tried would convince it to recognize a line break that wasn't hardcoded. I noticed that bash is actually quite inconsistent in its regexp syntax for these style substitutions, in fact: character classes and negative character classes work, and character ranges work, but negative character ranges don't. The documentation also leads me to believe that POSIX named character classes (e.g. [:lower:]) should work, too, but they don't. Anyway, eventually I just went with a call to tr:

   uri = ${URI//;/%3B};
   doctype = ${DOCTYPE// /+};
   output = xml

$(tr -d " \n" <<<$QUERY)

Which is a little more verbose than I would have liked and makes a call to an external program, but it's readable, even clear, and it doesn't have those annoying reverse soliduses all over the place. If you have a better idea, though, please let me know!

Terje Bless is hitting up Jim Ley to produce a client side Javascript version of the validation by logo service, and if he gets round to it that's sure to be good. I am a little concerned, however, that the idea doesn't scale up all that well if you consider images embedded in pages that get millions of hits per day. It might be possible to cache replies from the validator and only ask for new validation results if the page has been modified since the last time of validation. But either way, I think that the overall concept is solid and useful.

User-Agent Abuse


According to RFC 2616, the User-Agent header is a statistical datapoint and capability preference, allowing the receiving site to serve pages based on what the client is known to be able to receive: "This [header] is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations." So if the limitations of your user agent change, you can modify the User-Agent field that you send appropriately.

With this in mind, I often set my User-Agent header to "Mozilla/5.0 (Something)" when I'm using wget, curl, or urllib in Python, but I'm often told that this is a bad thing, even an abuse of the header. That's absurd; the abuse is usually on the server side, not the client. I fake the User-Agent because many sites don't allow download via curl or wget—two that spring immediately to mind are google.com and f2o.org. These sites have a legitimate practical reason to do so: presumably a high percentage of the hits they receive from these user agents are crawlers and bots. With Google especially, this is going to cost them a lot of money, so blocking is prudent.

But bots should adhere to robots.txt, and I'll bet that a significant portion of the requests that curl and wget banning sites receive from those clients are legitimate. Their filtering is, therefore, a technical solution to a societal problem. It's a bit like banning Firefox on a framed site because Firefox can display the content unframed. So whilst I realise that banning the clients server side is something that pragmatically just has to be done, a hack to save a lot of money and bandwidth, it's an abuse of the User-Agent header, and it's taking place on the server. Getting around that by faking the User-Agent header client side is abuse by neither morals nor specification, as long as the client is being used legitimately.

(Tip of the hat to John Cowan.)

GRDDL for XHTML Schemata Associations


For validation and as an editor hint in emacs's nxml-mode, I use a little RELAX NG Compact schema called xhtml.rnc which allows a subset of XHTML 1.0 Strict. So any documents that I write conforming to it should also hopefully be valid XHTML 1.0 Strict. But how do I make the association between an instance document and this schema formal?

The obvious choice would be to use the schema as a value of the profile attribute, which is designed as either a global unique name for dispatch of arbitrary facilities, or as a namespace for @rel and @rev. My use of it here would be for the former purpose:

<html xmlns="http://www.w3.org/1999/xhtml">
<head profile="http://inamidst.com/proj/quality/xhtml.rnc">

Sadly, though, this conflicts with languages which provide facilities under the latter purpose, so in other words if I use the profile attribute for my own devices, I won't be able to use it when some other language comes along that needs it. One such language that may well be very popular in the future, and is already specified, is GRDDL. But GRDDL is special in that it is itself a generalised mechanism for allowing arbitrary extra structure to be added to HTML, in such a way as to be clearly authorised by the creator of the document.

So it would be possible to use GRDDL to provide this schema hint, as long as we used the GRDDL mechanism properly. Here's what I envisage:

<html xmlns="http://www.w3.org/1999/xhtml">
<head profile="http://www.w3.org/2003/g/data-view">
<link rel="transformation" href="http://example.org/xhtml/transform" />
<link rel="schema" href="http://example.org/xhtml/schema" />

This is okay because HTML 4.01 says that "Authors may wish to define additional link types not described in this specification. If they do so, they should use a profile to cite the conventions used to define the link types.", which means that it's up to GRDDL to define what rel="schema" means, but GRDDL doesn't seem to mind as long as you obey its rel="transformation" way of doing things. So the definition for rel="schema" comes from the output of the transformation. The transform URI here could be used as a globally unique value itself, if you know the rel="schema" convention that it formalises.

For more information on this topic, see the #swig chat that I had with Dan Connolly as to whether it was valid to use rel attribute values not defined by GRDDL for your own use even when using the GRDDL profile.

Antikythera Mechanism in Python


I've ported the Antikythera Mechanism to Python: antikythera.py The Antikythera Mechanism is an ancient Greek astronomical calculator made from a gaggle of gears, so it's a kind of digital ratioing machine. I just modelled the gears in Python then bound it all together per some schematics of the mechanism. Here's an example of how to use the script:

$ ./antikythera.py 20
Sun: 20.0°
Moon: 267.36842°
4 Year Dial: -5.0°
Synodic Month: -247.36842°
Lunar Year: -20.61404°

The input argument is the number of degrees clockwise through which to turn the drive wheel. The outputs are the number of degrees through which the respective output gears have been moved. The latter three go anticlockwise because you look at them from the other side of the mechanism, behind the base plate.



For all of you who've been waiting for me to release this, and all of you who haven't, my Pluvo Programming Language project is now go. From the homepage: "Pluvo is a nascent experimental scripting language with an easy to use syntax, built in test facilities, and modern datatypes. High level and data structured, it makes things easier for the programmer by incorporating idioms from a wide range of languages in a consistent manner."

I had been working hard to get it to a level of maturity where it was at least a curiosity and had the major structural functions working so that people could get a feel for them. It's also, hopefully, at a level where people can actually have a go at adding bits themselves, not that I expect that to happen particularly. This means that I haven't really discussed much of the feature set with anyone, which has been very difficult to avoid!

So there you go: feel free to download it, poke at it, get it running and so on, but don't expect too much from it. It's mainly the concept of the thing and the ideas implemented in it that are the fun thing. It might even prove to be something that I continue to the point of actually maintaining various of my scripts in; especially, perhaps, CGIs which I think could turn out to be quite nice written in Pluvo.

Created by @sbp