Saturday, January 31, 2009

Humor: This Statement is False

I was reading some C code the other day, and it said:
/* no comment */

Thursday, January 29, 2009

Funny: IronPort Video

I worked at IronPort for two and a half years. I really enjoyed it. Later, they were bought by Cisco. Here's a video from Cisco for an IronPort product: Weird Things Happen in Office. It's sort of a Rube Goldberg setup.

Monday, January 26, 2009

A REST-RPC Hybrid for Distributed Computing

It's everything the Web was never meant to do...and so much more!

While I was doing some consulting with Warren DeLano of PyMOL, we envisioned a REST-RPC hybrid for use in distributed computing.

Imagine an RPC system built on top of REST. (I know that REST enthusiasts tend to really dislike the RPC style, but bear with me.) That means two things. It means you're calling remote functions, but it also means you should take advantage of HTTP as much as possible. That means you should use URLs, proxies, links, the various HTTP verbs, etc.

Imagine every RPC function takes one JSON value as input and provides one JSON value as output. Such a system could be used with relative easy in a wide range of programming languages, even C. I'm thinking of a P2P system with many servers who talk to each using RESTful Web APIs, but do so in an XML-RPC sort of way. However, instead of using XML, they use JSON.

Here's what I think URLs should look like: http://server:port/path/to/object/;method?arg1=foo&args2=bar. This would convert to a JSON list of args. There could be alternate syntax for the arguments, such as ?__argument__=some_encoded_json if the JSON structure is deep.

The method may choose to support GET or POST in the usual RESTful way. GET requests should not have side effects.

To support asynchronous operation, you can make an RPC and then pass it a callback URL which the remote server can call later via RPC. There must be a way of registering a temporary callback URL in order to support asynchronous operation. You might even use a list of callbacks in order to create a pipeline. This is like a stack of continuations.

Various things like HTTP auth may be used. Various HTTP status codes should be used in ways that make sense. Internally, bindings for languages like Python should translate HTTP error status codes to exceptions. If a server returns an error status code, a traceback object in JSON makes sense.

It makes sense to do something useful for actual UNIX pipes. For instance, if an app is started in a pipeline, it can output URLs to STDOUT so that the two processes connected via a UNIX pipe can connect using the protocol.

This platform would allow you to make use of the strengths and weaknesses of any platform such as C, Python, IronPython, MPI in C, etc. There might be a server written in C that can use SIMD or the GPU. There might be another server that can output stuff to the screen within an OpenGL context.

It needs some way to stream data. Perhaps the protocol is used to do all the setup, and then it gives you a URL where you can stream stuff in any format. After all, it is a web server. You could probably even stream JSON objects, one per line.

I'd like to thank Warren DeLano for giving me an opportunity to blog about our conversations.

Sunday, January 25, 2009

Logic: Occam's Razor

I just read the summary of Occam's Razor on Wikipedia, and it turns out that most people, including me, don't understand what he was really trying to say. Specifically, it does not mean "All other things being equal, the simplest solution is the best." Here's the quote:
Ockham's razor (sometimes spelled Occam's razor) is a principle attributed to the 14th-century English logician and Franciscan friar, William of Ockham. The principle states that the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory. The principle is often expressed in Latin as the lex parsimoniae ("law of parsimony" or "law of succinctness"): "entia non sunt multiplicanda praeter necessitatem", roughly translated as "entities must not be multiplied beyond necessity". An alternative version "Pluralitas non est ponenda sine necessitate" translates "plurality should not be posited without necessity". [1]

This is often paraphrased as "All other things being equal, the simplest solution is the best." In other words, when multiple competing hypotheses are equal in other respects, the principle recommends selecting the hypothesis that introduces the fewest assumptions and postulates the fewest entities. It is in this sense that Occam's razor is usually understood. This is, however, incorrect. Occam's razor is not concerned with the simplicity or complexity of a good explanation as such; it only demands that the explanation be free of elements that have nothing to do with the phenomenon (and the explanation).

Originally a tenet of the reductionist philosophy of nominalism, it is more often taken today as an heuristic maxim (rule of thumb) that advises economy, parsimony, or simplicity, often or especially in scientific theories. Here the same caveat applies to confounding topicality with mere simplicity. (A superficially simple phenomenon may have a complex mechanism behind it. A simple explanation would be simplistic if it failed to capture all the essential and relevant parts.)

Saturday, January 17, 2009

Linux: Fun with Big Files

Recently, I was playing with a 150G compressed XML file containing a Wikipedia dump. Trying to delete the file gave me a fun glimpse into some Linux behavior that I normally wouldn't notice.

I had a job running that was parsing the file. I hit Control-c to kill the job. Then I deleted the file. "rm" returned immediately. I thought to myself, wow, that was fast. Hmm, I'm guessing that all it had to do was unlink the file. I would have figured it would have taken longer to mark all the inodes as free.

I ran "df -h" to see if there was now 150G more free space on my drive. There was no new free space. Hmm, that's weird. I futzed around for a bit. I started cycling through my tabs in screen. I discovered that I had only killed the job that was tailing one of the files, not the actual job itself.

This reminded me that Linux uses reference counting for files. Even if you can't get to a file through the filesystem, a file might still exist because a program has an open file handle for it. That's how "tempfile.TemporaryFile" works.

I killed the job. I ran "df -h". I now saw a bunch of free space. For some reason, even though I hit Control-c, the job hasn't returned, and I haven't been given a new shell prompt. Hitting Control-c again doesn't help. In fact, I can't even hit Control-z to "kill -9 %1" the job. Normally, that always works. Hmm, that's weird.

I switched to another tab in screen. I ran "ps aux". I don't see the job. I switched back to my other tab. The shell is still frozen. Hmm, that's really weird.

I typed "df -h" over and over again. I can see free disk space slowly returning. After several minutes, I finally got a new shell prompt. I can now see 150G of new free disk space.

Here's what I think happened. When I hit Control-c, the program exited. The kernel removed the process from the process table. While doing this, it closed the open file handle to the 150G file. Next, it had to start freeing inodes. 150G is a lot of inodes to free. Hence, even though there was no entry in the process table (hence the program was not visible to "ps aux"), the process was still stuck in kernel mode freeing up inodes.

Linux is fun ;)

Thursday, January 15, 2009

Personal: Looking for Work

Hey guys, I'm looking for work.

I'd prefer to work part-time from home since I'm already working part-time from home on another startup. I'm better at building startups from scratch than I am at rescuing crufty code. I'm better at engineering scalable systems than I am at whipping out throw-away prototypes in a hurry.

I have clean code, a friendly demeanor, and great references. Here's my resume.

By the way, sorry for the advertisement ;)

Wednesday, January 14, 2009

Computer Science: The Autoturing Test

Can a personality construct in a virtual world apply a Turing test to itself?

In Neuromancer, William Gibson plays around with the idea of personality constructs. Dixie is a hacker who died, but his "personality" was recorded to a ROM. Within the matrix, you can interact with Dixie, and in fact, Dixie won't know he's a personality construct until you tell him.

Another thing that happens in the book is that Case flatlines. When he flatlines, time slows to a crawl, and he proceeds to "live" within the matrix at a more fundamental level.

My question is, is there some test that Case and Dixie can apply to themselves that will help each of them to figure out who is the real human?

Of course, this question is kind of meaningless at this point. It assumes that we'll someday be able to create personality constructs, but that they won't be the same as "the real thing."

Nonetheless, the deeper question remains. Is there a test that a human and an AI can apply to themselves that would lead the human to classify himself as a human and the AI to classify itself as an AI? Can the AI create the test itself?

Virtualization: VirtualBox

I've been using VMware Fusion, but I decided to give VirtualBox a try. It's from Sun. To summarize:
  • It seems faster than VMware Fusion
  • It's free and mostly open source
  • It's just a bit rougher around the edges
What do I mean it's mostly open source? There are two versions. According to their docs:
The VirtualBox Open Source Edition (OSE) is the one that has been released under the GPL and comes with complete source code. It is functionally equivalent to the full VirtualBox package, except for a few features that primarily target enterprise customers. This gives us a chance to generate revenue to fund further development of VirtualBox.

Please note that the Open Source Edition does not include an installer or setup utilities, as it is mainly aimed at developers and Linux distributors
What this means in practice is that it's not easy to use the open source version since there are no precompiled binaries and no installer. Hence, you're stuck with the free, but not open source version. The two things that I actually care about that are missing from the open source version are USB support and a gigabit ethernet controller. Oh well. That's still better than what I had to pay for VMware Fusion.

As for speed, I haven't actually timed it, but the BIOS stage of booting is crazy fast, and installing Ubuntu didn't seem to take forever like it did under VMware Fusion. Of course, this could be a figment of my imagination. I can't remember if I had the same amount of RAM when I installed Ubuntu under VMware Fusion either, so take my comments with a grain of salt. I will say that sound seems smoother.

Speaking of sound, by default it's turned off. That was easy to fix.

By default it uses NAT, and the host computer cannot connect to the guest computer. Since I like to login over ssh, that was a no go. I figured out how to switch to "Host Interface Networking", and I was happy again. In general, this is one area where VMware Fusion seemed to just work.

Just like VMware Fusion, VirtualBox has custom kernel mods for Linux. Installing them was easy. Once I did, the mouse was perfectly integrated between the host and guest computers. Furthermore, full screen mode now uses the same resolution as my Mac. Sweet!

To be fair, VMware Fusion does the same thing. Of course, this only works for Linux and Windows. There are no kernel mods available (that I know of) for other operating systems like FreeBSD.

One more feature that I haven't bothered trying out is:
Shared folders. Like many other virtualization solutions, for easy data exchange between hosts and guests, VirtualBox allows for declaring certain host directories as "shared folders", which can then be accessed from within virtual machines.
Anyway, it's good stuff. I'm guessing that VMware Fusion is probably better if you need to run a Windows client (because of all the "Fusion" functionality), but if you just need to run a Linux client, VirtualBox is free and good.

Tuesday, January 13, 2009

Python: Parsing Wikipedia Dumps Using SAX

Wikipedia provides a massive dump containing all edits on all articles. It's about 150gb and takes about a week to download. The file is http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history.xml.bz.

Clearly, to parse such a large file, you can't use a DOM API. You must use something like SAX. There is a Python library to parse this file and shove it into a database, but I actually don't want it in a database. Here's some code to parse the data, or at least the parts I care about:

Updated! Fixed the fact that the characters method must apply its own buffering. Fixed an encoding issue.
#!/usr/bin/env python

"""Parse the enwiki-latest-pages-meta-history.xml file."""

from __future__ import with_statement

from contextlib import closing
from StringIO import StringIO
from optparse import OptionParser
import sys
from xml.sax import make_parser
from xml.sax.handler import ContentHandler

from blueplate.parsing.tsv import create_default_writer

__docformat__ = "restructuredtext"


class WPXMLHandler(ContentHandler):

"""Parse the enwiki-latest-pages-meta-history.xml file.

This parser looks for just the things we're interested in. It maintains a
tag stack because the XML format actually does have some depth and context
does actually matter.

"""

def __init__(self, page_handler):
"""Do some setup.

page_handler
This is a callback. It will be a called with a page in the form
of a dict such as::

{'id': u'8',
'revisions': [{'timestamp': u'2001-01-20T15:01:12Z',
'user': u'ip:pD950754B.dip.t-dialin.net'},
{'timestamp': u'2002-02-25T15:43:11Z',
'user': u'ip:Conversion script'},
{'timestamp': u'2006-09-08T04:16:46Z',
'user': u'username:Rory096'},
{'timestamp': u'2007-05-24T14:41:48Z',
'user': u'username:Ngaiklin'},
{'timestamp': u'2007-05-25T17:12:09Z',
'user': u'username:Gurch'}],
'title': u'AppliedEthics'}

"""
self._tag_stack = []
self._page_handler = page_handler

def _try_calling(self, method_name, *args):
"""Try calling the method with the given method_name.

If it doesn't exist, just return.

Note, I don't want to accept **kargs because:

a) I don't need them yet.
b) They're really expensive, and this function is going to get called
a lot.

Let's not think of it as permature optimization, let's think of it as
avoiding premature flexibility ;)

"""
try:
f = getattr(self, method_name)
except AttributeError:
pass
else:
return f(*args)

def startElement(self, name, attr):
"""Dispatch to methods like _start_tagname."""
self._tag_stack.append(name)
self._try_calling('_start_' + name, attr)
self._setup_characters()

def _start_page(self, attr):
self._page = dict(revisions=[])

def _start_revision(self, attr):
self._page['revisions'].append({})

def endElement(self, name):
"""Dispatch to methods like _end_tagname."""
self._teardown_characters()
self._try_calling('_end_' + name)
self._tag_stack.pop()

def _end_page(self):
self._page_handler(self._page)

def _setup_characters(self):
"""Setup the callbacks to receive character data.

The Parser will call the "characters" method to report each chunk of
character data. SAX parsers may return all contiguous character data
in a single chunk, or they may split it into several chunks. Hence,
this class has to take care of some buffering.

"""
method_name = '_characters_' + '_'.join(self._tag_stack)
if hasattr(self, method_name):
self._characters_buf = StringIO()
else:
self._characters_buf = None

def characters(self, s):
"""Buffer the given characters."""
if self._characters_buf is not None:
self._characters_buf.write(s)

def _teardown_characters(self):
"""Now that we have the entire string, put it where it needs to go.

Dispatch to methods like _characters_some_stack_of_tags. Drop strings
that are just whitespace.

"""
if self._characters_buf is None:
return
s = self._characters_buf.getvalue()
if s.strip() == '':
return
method_name = '_characters_' + '_'.join(self._tag_stack)
self._try_calling(method_name, s)

def _characters_mediawiki_page_title(self, s):
self._page['title'] = s

def _characters_mediawiki_page_id(self, s):
self._page['id'] = s

def _characters_mediawiki_page_revision_timestamp(self, s):
self._page['revisions'][-1]['timestamp'] = s

def _characters_mediawiki_page_revision_contributor_username(self, s):
self._page['revisions'][-1]['user'] = 'username:' + s

def _characters_mediawiki_page_revision_contributor_ip(self, s):
self._page['revisions'][-1]['user'] = 'ip:' + s


def parsewpxml(file, page_handler):
"""Call WPXMLHandler.

file
This is the name of the file to parse.

page_handler
See WPXMLHandler.__init__.

"""
parser = make_parser()
wpxmlhandler = WPXMLHandler(page_handler)
parser.setContentHandler(wpxmlhandler)
parser.parse(file)


def main(argv=None, # Defaults to sys.argv.
input=sys.stdin, _open=open):

"""Run the application.

The arguments are really there for dependency injection.

"""

def page_handler(page):
"""Write the right bits to the right files."""
try:
atoms_writer.writerow((page['id'], page['title']))
for rev in page['revisions']:
if not 'user' in rev:
continue
triplets_writer.writerow(
(rev['user'], rev['timestamp'], page['id']))
except Exception, e:
print >> sys.stderr, "%s: %s\n%s" % (parser.get_prog_name(),
e, page)

global parser
parser = OptionParser()
parser.add_option('--atoms', dest='atoms',
help="store atom ids and names in this file",
metavar='FILE.tsv')
parser.add_option('--user-timestamp-atom-triplets',
dest='user_timestamp_atom_triplets',
help="store (user, timestamp, atom) triplets in this file",
metavar='FILE.tsv')
(options, args) = parser.parse_args(args=argv)
if args:
parser.error("No arguments expected")
for required in ('atoms', 'user_timestamp_atom_triplets'):
if not getattr(options, required):
parser.error('The %s parameter is required' % required)

LINE_BUFFERED = 1
with closing(_open(options.atoms, 'w', LINE_BUFFERED)) as atoms_file:
with closing(_open(options.user_timestamp_atom_triplets,
'w', LINE_BUFFERED)) as triplets_file:
atoms_writer = create_default_writer(atoms_file)
triplets_writer = create_default_writer(triplets_file)
parsewpxml(input, page_handler)


if __name__ == '__main__':
main()
I created enwiki-latest-pages-meta-history.test.xml as a short snippet of the XML just so I could do some testing:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/expor
t-0.3.xsd" version="0.3" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.13alpha</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2">Media</namespace>
<namespace key="-1">Special</namespace>
<namespace key="0" />
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
<namespace key="3">User talk</namespace>
<namespace key="4">Wikipedia</namespace>
<namespace key="5">Wikipedia talk</namespace>
<namespace key="6">Image</namespace>
<namespace key="7">Image talk</namespace>
<namespace key="8">MediaWiki</namespace>
<namespace key="9">MediaWiki talk</namespace>
<namespace key="10">Template</namespace>
<namespace key="11">Template talk</namespace>
<namespace key="12">Help</namespace>
<namespace key="13">Help talk</namespace>
<namespace key="14">Category</namespace>
<namespace key="15">Category talk</namespace>
<namespace key="100">Portal</namespace>
<namespace key="101">Portal talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>AppliedEthics</title>
<id>8</id>
<revision>
<id>233189</id>
<timestamp>2001-01-20T15:01:12Z</timestamp>
<contributor>
<ip>pD950754B.dip.t-dialin.net</ip>
</contributor>
<minor />
<comment>*</comment>
<text xml:space="preserve">Something the Marketing Dept. will never fully understand.

</text>
</revision>
<revision>
<id>15898943</id>
<timestamp>2002-02-25T15:43:11Z</timestamp>
<contributor>
<ip>Conversion script</ip>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">#REDIRECT [[Applied ethics]]
</text>
</revision>
<revision>
<id>74466767</id>
<timestamp>2006-09-08T04:16:46Z</timestamp>
<contributor>
<username>Rory096</username>
<id>750223</id>
</contributor>
<comment>cat rd</comment>
<text xml:space="preserve">#REDIRECT [[Applied ethics]] {{R from CamelCase}}</text>
</revision>
<revision>
<id>133180238</id>
<timestamp>2007-05-24T14:41:48Z</timestamp>
<contributor>
<username>FunnyCharé</username>
<id>4477979</id>
</contributor>
<minor />
<comment>Robot: Automated text replacement (-\[\[(.*?[\:|\|])*?(.+?)\]\] +\g<2>)</comment>
<text xml:space="preserve">#REDIRECT Applied ethics {{R from CamelCase}}</text>
</revision>
<revision>
<id>133452279</id>
<timestamp>2007-05-25T17:12:09Z</timestamp>
<contributor>
<username>Gurch</username>
<id>241822</id>
</contributor>
<minor />
<comment>Revert edit(s) by [[Special:Contributions/FunnyCharé|FunnyCharé]] to last version by [[Special:Contributions/Rory096|Rory096]]</comment>
<text xml:space="preserve">#REDIRECT [[Applied ethics]] {{R from CamelCase}}</text>
</revision>
</page>
<page>
<title>AccessibleComputing</title>
<id>10</id>
<revision>
<id>233192</id>
<timestamp>2001-01-21T02:12:21Z</timestamp>
<contributor>
<username>RoseParks</username>
<id>99</id>
</contributor>
<comment>*</comment>
<text xml:space="preserve">This subject covers

* AssistiveTechnology

* AccessibleSoftware

* AccessibleWeb

* LegalIssuesInAccessibleComputing

</text>
</revision>
<revision>
<id>862220</id>
<timestamp>2002-02-25T15:43:11Z</timestamp>
<contributor>
<ip>Conversion script</ip>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">#REDIRECT [[Accessible Computing]]
</text>
</revision>
<revision>
<id>15898945</id>
<timestamp>2003-04-25T22:18:38Z</timestamp>
<contributor>
<username>Ams80</username>
<id>7543</id>
</contributor>
<minor />
<comment>Fixing redirect</comment>
<text xml:space="preserve">#REDIRECT [[Accessible_computing]]</text>
</revision>
<revision>
<id>56681914</id>
<timestamp>2006-06-03T16:55:41Z</timestamp>
<contributor>
<username>Nzd</username>
<id>516514</id>
</contributor>
<minor />
<comment>fix double redirect</comment>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]</text>
</revision>
<revision>
<id>74466685</id>
<timestamp>2006-09-08T04:16:04Z</timestamp>
<contributor>
<username>Rory096</username>
<id>750223</id>
</contributor>
<comment>cat rd</comment>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
</revision>
<revision>
<id>133180268</id>
<timestamp>2007-05-24T14:41:58Z</timestamp>
<contributor>
<username>FunnyCharé</username>
<id>4477979</id>
</contributor>
<minor />
<comment>Robot: Automated text replacement (-\[\[(.*?[\:|\|])*?(.+?)\]\] +\g<2>)</comment>
<text xml:space="preserve">#REDIRECT Computer accessibility {{R from CamelCase}}</text>
</revision>
<revision>
<id>133452289</id>
<timestamp>2007-05-25T17:12:12Z</timestamp>
<contributor>
<username>Gurch</username>
<id>241822</id>
</contributor>
<minor />
<comment>Revert edit(s) by [[Special:Contributions/FunnyCharé|FunnyCharé]] to last version by [[Special:Contributions/Rory096|Rory096]]</comment>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
</revision>
<!-- For some reason, I encountered an edit with no IP or username. -->
<revision>
<id>9339391</id>
<timestamp>2005-01-13T17:22:17Z</timestamp>
<contributor>
<ip />
</contributor>
<text xml:space="preserve">blah, blah, blah</text>
</revision>
</page>
</mediawiki>
Here's my test code:
"""Test the parsewpxml module."""

from StringIO import StringIO # cStringIO won't work here.
import os

from nose.tools import assert_true, assert_equal

from projects.wp import parsewpxml

XML_FILE = os.path.join(os.path.dirname(__file__),
'enwiki-latest-pages-meta-history.test.xml')

__docformat__ = "restructuredtext"


def test_xml_file_exists():
assert_true(os.path.exists(XML_FILE))


def test_parsewpxml():

def page_handler(page):
page_list.append(page)

expected = \
[{'id': u'8',
'revisions': [{'timestamp': u'2001-01-20T15:01:12Z',
'user': u'ip:pD950754B.dip.t-dialin.net'},
{'timestamp': u'2002-02-25T15:43:11Z',
'user': u'ip:Conversion script'},
{'timestamp': u'2006-09-08T04:16:46Z',
'user': u'username:Rory096'},
{'timestamp': u'2007-05-24T14:41:48Z',
'user': u'username:FunnyChar\xe9'},
{'timestamp': u'2007-05-25T17:12:09Z',
'user': u'username:Gurch'}],
'title': u'AppliedEthics'},
{'id': u'10',
'revisions': [{'timestamp': u'2001-01-21T02:12:21Z',
'user': u'username:RoseParks'},
{'timestamp': u'2002-02-25T15:43:11Z',
'user': u'ip:Conversion script'},
{'timestamp': u'2003-04-25T22:18:38Z',
'user': u'username:Ams80'},
{'timestamp': u'2006-06-03T16:55:41Z',
'user': u'username:Nzd'},
{'timestamp': u'2006-09-08T04:16:04Z',
'user': u'username:Rory096'},
{'timestamp': u'2007-05-24T14:41:58Z',
'user': u'username:FunnyChar\xe9'},
{'timestamp': u'2007-05-25T17:12:12Z',
'user': u'username:Gurch'},
{'timestamp': u'2005-01-13T17:22:17Z'}],
'title': u'AccessibleComputing'}]

page_list = []
parsewpxml.parsewpxml(XML_FILE, page_handler)
assert_equal(page_list, expected)


def test_main():

"""Testing the main method involves a fair bit of dependency injection."""

class UnclosableStringIO(StringIO):

"""This is a StringIO that ignores the close method."""

def close(self):
pass

def _open(name, *args):
"""Return StringIO() buffers instead of real open file handles."""
if name == 'atoms.tsv':
return atoms_file
elif name == 'triplets.tsv':
return triplets_file
else:
raise ValueError

atoms_file = UnclosableStringIO()
triplets_file = UnclosableStringIO()
parsewpxml.main(
argv=['--atoms=atoms.tsv',
'--user-timestamp-atom-triplet=triplets.tsv'],
input=open(XML_FILE),
_open=_open)

expected_atoms = """\
8\tAppliedEthics
10\tAccessibleComputing
"""

expected_triplets = """\
ip:pD950754B.dip.t-dialin.net\t2001-01-20T15:01:12Z\t8
ip:Conversion script\t2002-02-25T15:43:11Z\t8
username:Rory096\t2006-09-08T04:16:46Z\t8
username:FunnyChar\xc3\xa9\t2007-05-24T14:41:48Z\t8
username:Gurch\t2007-05-25T17:12:09Z\t8
username:RoseParks\t2001-01-21T02:12:21Z\t10
ip:Conversion script\t2002-02-25T15:43:11Z\t10
username:Ams80\t2003-04-25T22:18:38Z\t10
username:Nzd\t2006-06-03T16:55:41Z\t10
username:Rory096\t2006-09-08T04:16:04Z\t10
username:FunnyChar\xc3\xa9\t2007-05-24T14:41:58Z\t10
username:Gurch\t2007-05-25T17:12:12Z\t10
"""

assert_equal(expected_atoms, atoms_file.getvalue())
assert_equal(expected_triplets, triplets_file.getvalue())

Saturday, January 10, 2009

Graphics: Photosynth

This is probably old-hat to most people, but if you haven't seen the Photosynth demo at TED, it's an absolutely must see. It's definitely in the top ten coolest things I saw in 2007. Give it 60 seconds, and you'll see what I mean.

Sunday, January 04, 2009

Vim: jVi

jVi is a plugin for NetBeans that provides Vim-like key bindings. The good news is that it's close enough to be comfortable instead of frustrating. It's better than most Vi emulation modes (including the one in Komodo Edit) and it's way better than the key bindings provided by NetBeans (of course, that's a matter of opinion). The bad news is that certain key features like rectangular select (Cntl-v) and rewrapping block comments (gq}) don't work. So far, those are my two biggest complaints.

First of all, installing the plugin was painless. I downloaded it using my browser, unzipped it, and installed it via the Tools :: Plugins menu item in NetBeans. Easy peasy.

Next, I went down the list of complaints I had about the Vim key bindings in Komodo Edit and tried each of them in jVi. Many things were fixed. Some still didn't work. Here is a list of my discoveries:

Using ":e filename" to open a file doesn't work.

Using Cntl-o to go back to where you were previously works, but using Cntl-i to go forward doesn't work because NetBeans intercepts it.

By default, NetBeans knows that Python indents things by 4 spaces. However, jVi doesn't know this, so by default, it wants to indent things by a tab. I'm surprised that it doesn't make use of the NetBeans settings.

"cw tab tab tab" inserts three things into the undo list instead of one. Of course, this is just pedantic.

Strangely, using % to jump between ( and ) inside comments doesn't work, but it does work if you're not in a comment. No biggie.

To fix the indention settings, I instinctively typed "set sts=4 sw=4 et ai", which means "set the soft tab stops to 4, shift width to 4, emulate tabs, and auto indent." jVi said that "sts" and "ai" are not implemented. It doesn't matter because it does the right thing anyway.

The line in column 80 still works, which puts it ahead of Vim ;)

Things like code autocomplete and tips still work.

You can rewrap paragraphs using "gq". However, this doesn't work for block comments because the "#" at the beginning of each line gets messed up. Thankfully, I tend to have more docstrings than block comments.

Cntl-n (autocomplete the symbol being typed) works, which was a bit of a surprise.

Anyway, as I said earlier, it's not perfect, but it's close enough to make me happier than the default key bindings in NetBeans.

Saturday, January 03, 2009

IDE: NetBeans

After trying out Komodo Edit I decided to give NetBeans a whirl. Here's the summary: NetBeans is a pleasant to use, reasonably well-polished IDE that mysteriously seems to be missing certain key features that even Komodo Edit has. If I were to put my finger on it, I'd say that NetBeans is better at being an IDE (doing things such as code completion, code tips, etc.), but has a worse editor (for instance, it lacks a rectangle selection mode and it has no option to rewrap a multi-line comment block).
From the Web Site
Here are some high-level bits from the web site along with some of my own comments:
In addition to full support of all Java platforms (Java SE, Java EE, Java ME, and JavaFX), the NetBeans IDE 6.5 is the ideal tool for software development with PHP, Ajax and JavaScript, Groovy and Grails, Ruby and Ruby on Rails, and C/C++.

Discover the joys of Python programming with the NetBeans IDE for Python Early Access. Enjoy great editor features such as code completion, semantic highlighting, and more. The EA release also includes a community developed Python debugger and offers a choice of the Python and Jython runtimes.
Python is only supported in the early access release. I expect Python support to improve over time.
The NetBeans editor for Python supports Smart Indent, Outdent, and Pair matching, additional to syntactic and semantic highlighting, code folding, instant rename refactoring, mark occurrences, finding undefined names, and Quick Fixes. Code completion is available for local function and variable names as well as Python keywords. The editor also assists you by inserting and fixing import statements.
All that stuff seems to work. I opened a file. It gave me a PyLint-like warning that said, "The first argument to a method should be self or cls. I was using klass. I right clicked on klass and said rename. It renamed all the occurrences. Easy.
With the NetBeans IDE for PHP, you get the best of both worlds: the productivity of an IDE (code completion, real-time error checking, debugging and more) with the speed and simplicity of your favorite text editor in a less than 30mb download.
The IDE stuff works well. However, it definitely can't touch the speed of my favorite text editor ;) The fact that the download was only 25mb (109mb uncompressed) was indeed quite impressive in comparison to Eclipse.
The PHP Editor in NetBeans IDE 6.5 supports all standard features such as code completion, syntax highlighting, mark occurrences, refactoring, code templates, documentation pop-up, code navigation, editor warnings and task list.
The documentation pop-ups are amazing. The documentation for JavaScript even includes browser compatibility notes, and the documentation for HTML is straight from the DTD. Furthermore, the code completion isn't pre-baked as it is in Komodo Edit. If you register a new JavaScript library, it can do code completion on that too.
The NetBeans IDE has the JavaScript tools you need: an intelligent JavaScript editor, CSS/HTML code completion, the ability to debug JavaScript in Firefox and IE, and bundled popular JavaScript libraries. Your favorite JavaScript framework will get you 80% of the way, NetBeans IDE will help you with that last 20%.
Yep, all that stuff seems to work. It was even able to do code completion on CSS content when I was in a PHP file. You probably shouldn't put CSS blocks in PHP files in general, but the fact that it could still parse it and do code completion is impressive.
The Good Parts
First of all, let me say that NetBeans is stable. It hasn't crashed on me yet. It's also pleasantly attractive. I like the rounded corners, fonts, and icons. I don't feel overwhelmed by NetBeans like I do by Eclipse. Also, the training videos were very well done.
The Bad Parts
If I have a multi-line Python comment, there's no way to rewrap the lines. In Emacs, this is M-q. In Vim, it's gq}. I generally consider that a must-have editor feature.

It doesn't seem to support rectangular selections or column editing (i.e. cntl-v in Vim). I always say that that's one of the features that separates the really good editors from the mediocre ones (Vim, Emacs, Nedit, and Komodo Edit all have it). It's such a useful feature. When I searched for "column" in the documentation, one of the search results was how to add a column to a database. That really underscores my point that NetBeans is a great IDE, but not necessarily the best editor.

There are no Vim or Emacs key bindings. That's sort of a bummer. To be fair, no one ever gets Vim key bindings perfectly right anyway. Nonetheless, the IDE is decidedly mouse heavy. Many common editor tasks that deserve a key binding don't have one.

Opening up files is a bit painful. I really missed tab completion like in Vim and Emacs. What's worse is that Apple-o isn't the key binding for opening up a file. In general there are a lot of places where NetBeans follows the standard OS X conventions such as Apple-x to cut, but there are some strange places where it deviates from those conventions.

It doesn't support editing a file remotely using scp.

I don't see a way to tell it to run "make test". Komodo Edit let me run "make test" and then click on file names if there were errors in the output.
Other Commentary
I created a new project from existing sources. On the downside, this created a new folder in my project. On the upside, that folder was only 8k worth of data. I definitely don't feel like I have to make everyone on my team switch to NetBeans before I can start using it. It says that it can synchronize with Eclipse projects, but I don't actually need that feature. During the project creation wizard, it expected my source files and my tests files to live in different trees. That was sort of weird. I just told it to use the same directory for both.

It says it can do JavaScript debugging in Firefox and IE, but I didn't get a chance to try that out.

It supports code snippets, if you're into that sort of thing.

For PHP, it can do code completion of symbols in other files. It also shows you the documentation that you wrote. I tested it, and it worked for Python too.

For PHP, it will warn you of uninitialized variables.

It does have version control support built in. The graphical diff utility was very nice. Committing code changes was painless.

As you would expect, it does provide a line at column 80. As I mentioned before, that was a pain for me in Vim.

The code folding support works very well. Functions are automatically recognized as something you can fold. Better yet, it understands HTML well enough to fold blocks of HTML.

When I opened a Python file, there was a widget that showed me an outline of the file, including all the classes and methods. It worked pretty well in PHP too.

It said that "from __future__ import with_statement" was an unused import. I think that's a sign that its Python support is still pretty young.

There's a margin to the left of the code that shows me what lines I've changed. It's like the diff is built into the editing experience.

In general, the code completion is pretty good. However, it got confused when I typed "from a import b; b.something". I think it didn't understand that I was importing an entire submodule.

The code completion for "self." was pretty helpful. It even showed me my own docs.

If I type "from a import ", it shows me a list of things I can import. That worked pretty well. However, when I picked something, it included the argument parameters like "from a import b(c)", which was sort of a weird bug.

When I'm calling a function, it tells me what arguments that function accepts. That works for builtin functions or functions in the current file, but it doesn't seem to work for functions in my other modules.

It recognizes JavaScript syntax errors, even if I'm in an HTML file. It knows that 10p is not a valid value in CSS for the border field, however 10px is.

When typing HTML, it's very helpful about adding closing tags, indenting within tags, etc. If I put my mouse on a tag, it shows me the closing tag.

The "Find Within Project" feature (i.e. project-wide grep) was powerful and friendly.

There was a menu item called "Insert a Method". It just dumped a code snippet at my current cursor location without even bothering to indent it properly. However, the "Insert a Property" menu item was truly helpful. It knew the correct idiom for creating properties (the one where you use locals() and **).

There are a ton of things in the Source menu that don't seem to work yet or at least don't work as I would expect them to.

The Python console is exactly as you might expect. I wonder when the IDEs are going to discover that IPython rules. Seriously, it doesn't matter that the shell is integrated into the IDE. If it isn't IPython, I'm not going to use it. By the way, I hit cntl-d in the shell, and it stopped responding ;)

If you open up a CSS file, you can use the CSS Builder and CSS Preview widgets. The CSS Builder is basically a point-and-click GUI for creating CSS. It's helpful, but not overly intelligent. For instance, I wanted the "margin" to "All" be 1px. It added four separate lines for margin-top, margin-bottom, etc. The CSS Preview widget is indeed helpful, although I'm not sure how well it will work as soon as you start getting multiple CSS files in the mix. Thankfully, Firebug helps out a lot for this problem.

When I closed the CSS file, the CSS Builder and CSS Preview widgets didn't go away as I would have expected them to. This probably suggests something deep about Eclipse's support for "Perspectives" which are sets of widgets useful for the task at hand. (There's a perspective for Python and there's a perspective for HTML and CSS.) Of course, perspectives are one of the things that make Eclipse feel overwhelming to me.

You can "undock" an editor window to put it into a new top-level window. I know Emacs fans are proud of this feature in Emacs.

It's possible to split the editor window to edit multiple files side-by-side. However, I must admit that I couldn't figure out how until I looked in the documentation.

I typed "getenv" and told NetBeans to add the import line. It did it correctly. It even added it to the correct block of imports, but that might have been by coincidence. Personally, I think this feature is overrated. Perhaps it's more critical in Java.

It told me that I was using an undefined variable when I used a global that was defined in another function. It doesn't know that in Python, you only have to use the global keyword if you want to rebind a global. It's not necessary if you merely want to read the value of a global.

In PHP I defined a function called f and then tried to call it. It was not able to autocomplete the name of the function. However, it was able to autocomplete on PHP builtin functions and to give me their parameters.

When I'm typing a multi-line comment in PHP (using "//") and hit enter, it adds "//" at the beginning of the next line. However, it doesn't do that for "#" in Python.

If I remove a colon from the end of a for loop or the end of a def, it doesn't
complain. That's a bit of a bummer since that's by far my most common syntax error.

Installing the PHP plugin was so easy. There was a list of plugins. I picked one and installed it. To be fair, I think this is more complex in Eclipse because Eclipse lets you install plugins from all over the web. In contrast, there are only about 100 different plugins for NetBeans. I'm guessing there are far more for Eclipse. I liked the fact that I could filter the plugins by the term I was looking for (in this case PHP).
Conclusion
I think I'll continue using NetBeans for a while. It's frustrating that it lacks advanced editor features, but that's not as heinous as the stability issues that Komodo Edit seems to suffer from. That's too bad, because Komodo Edit does a lot of things really right.

Thursday, January 01, 2009

C++: Counting Function Calls

How many function calls are involved in executing this piece of C++ (from a QT project):
/**
* Given a QString, safely escape it properly for sh. For example, given
* $`"\a\" return \$\`\"\a\\".
*/
QString
ConfIO::writeString(const QString s)
{
QString ret;

for (int i = 0; i < s.length(); i++)
{
QChar c = s[i];

if (c == '$' || c == '`' || c == '"' || c == '\\')
ret += '\\';
ret += c;
}

return ret;
}
If you don't count any function calls made by .length(), etc., I've counted
21 so far!

Python: Builds of PyWebkitGtk and Webkit-Glib-Gtk

I saw this on python-announce, and all I can say is "What the heck?" I think this means you can write a Python application and have it compile down to an Ajax application or a desktop application, but I could be wrong:
webkit-glib-gtk provides gobject bindings to webkit's DOM model. pywebkitgtk provides python bindings to the gobject bindings of webkit's DOM model.

files are available for download at: https://sourceforge.net/project/showfiles.php?group_id=236659&package_id=290457&release_id=650548

separate pre-built .debs for AMD64 and i386 Debian are included, for pywebkitgtk and webkit-gtk with gobject bindings to the DOM model. if you have seen OLPC/SUGAR's "hulahop", or if you have used Gecko / XUL DOM bindings, or KDE's KHTMLPart DOM bindings, you will appreciate the value of webkit-glib-gtk. pywebkitgtk with glib/gobject bindings basically brings pywebkitgtk on a par with hulahop.

if you find the thought of pywebkitgtk with glib bindings, and/or hulahop to be "all too much", then do consider looking at pyjd (the other download from the same location, above). pyjd - aka pyjamas-desktop - is a level "above" pywebkitgtk-glib, and is on a par with pykde, pyqt4, pygtk2, python-wxWidgets and other desktop-based widget sets. (side-note: the advantage of pyjd is that if you write an app which conforms to the pyjamas UI widget set API, you can compile the same python app source code to javascript and run it directly in all major web browsers: see http://pyjs.org, which is a python-to-javascript compiler).

code-stability-wise, pywebkitgtk and webkit-glib-gtk should be considered "experimental" (not least because this is a release from a svn build!). that having been said, pyjamas-desktop is declared "production" because pywebkitgtk with DOM bindings, thanks to webkit-glib-gtk, provides absolutely everything that pyjamas-desktop needs (and if webkit-glib-gtk becomes a moving target, the DOM.py abstraction layer in pyjamas-desktop will take care of it. if it becomes a _severe_ moving target, pyjamas-desktop will drop webkit and provide a python-hulahop / XUL-Geck port instead. or as well. whatevrrrr :).

gobject-interface-wise, the webkit gobject DOM bindings that have been added _can_ be considered to be "stable", as long as the underlying webkit library IDL files are "stable" (additions to Console.idl were made in the past couple of months, for example, and HTML5 is making advances as well). that having been said, _some_ functionality proved intransigent during the initial main development phase of the webkit gobject DOM bindings, such as RGBColour conversion of CSS Style Properties, and so were *temporarily* left out. given that pyjamas-desktop is considered "production", that should give a pretty clear indication of the importance of those rare bits of DOM model bindings features that were left out. SVG Canvas bindings, however, have NOT been included, as that would have added a further 120 gobjects to the list.

instructions for anyone brave enough to install webkit-glib-gtk from source, themselves, on macosx: http://github.com/lkcl/webkit/wikis/installing-webkit-glib-on-macosx there is an (experimental) Portfile in the macosx 10.4 glib tarball, as well.

please note that the MacOSX build is NOT a "native" webkit build: it is a GTK / X11 build (known as a "gtk port", in webkit developer terminology). the reason for providing the MacOSX webkit-glib-gtk build, along with a MacOSX port of pywebkitgtk is because the "native" webkit build - which includes ObjectiveC bindings and thus can automatically get python bindings - has very subtly different functionality. whilst the native ObjectiveC bindings are more fully compliant with the W3C standards, providing javascript-like functionality where absolutely necessary, the webkit-glib-gtk build's gobject bindings are going specifically for direct correspondance with the functionality provided by the webkit javascript bindings, falling back to alternatives where it is absolutely not possible to achieve that goal.

the actual differences, however, are extremely small, percentage-wise. out of around 300 objects, providing around 1,500 functions, and tens of thousands of properties, there are approximately 20 functions that are different, and only four properties that are different.

examples of the differences in the bindings APIs offered by ObjectiveC and webkit-glib-gtk Gobject bindings include:

* the provision of the function "toString", which is known as a javascriptism that is not in the W3C standard. _not_ providing this function, which is a de-facto standard, is considered to be inconvenient, especially as both Gecko's language bindings _and_ PyKDE's PyKHTMLPart bindings provide toString() functions. The ObjectiveC bindings, in sticking to the W3C standard, religiously, do not offer "toString". the reason for including toString in the webkit-glib-gtk bindings should be fairly obvious: it is unreasonable to expect developers who will be used to the de-facto existence of toString in javascript to find that it's ... disappeared for no good reason, thus forcing them to make unnecessary coding workarounds, duplicating the exact same functionality that *already* exists in the webkit library!

* hspace and vspace of HTMLAppletElement, and width and height of HTMLEmbedElement, are often (mistakenly) set to "NNNpx", "100%" and other values, in javascript, contrary to the W3C standards for HTMLAppletElement and HTMLEmbedElement, respectively. to make life easier for webkit applications (such as Safari, the iPhone browser and other important webkit applications), an exception was made to allow - and cater for or ignore, as appropriate, values in these non-standard formats. Whilst the ObjectiveC bindings stick to the W3C standards, and only allow hspace, vspace, width and height to be set to integer values, the webkit-glib-gtk bindings take advantage of the underlying webkit functions that perform the conversion, and are thus more tolerant - with the proviso of course that it's perfectly possible for users to shoot themselves in the foot by trying to set vspace="10em". such foot-shooting will be silently ignored - just as it is if you tried to do the same thing with javascript.

* XMLHTTPRequest.send accepts a DOMString on the webkit-glib-gtk bindings, whereas what should actually be passed in is a Webkit Document object. various attempts were made to create appropriate TextDocument and XMLDocument objects: unfortunately they failed miserably. fortunately, earlier versions of Webkit provided a version of XMLHTTPRequest.send which accepts a DOMString argument, and this version was reactivated for the webkit-glib-gtk bindings. the ObjectiveC and all other bindings successfully pass in a Webkit Document object. this issue will at some point need to be addressed, however it's pretty low priority: using a DOMString works just as well.

* Document.getSelection is considered to be a javascript-ism, and is not made available to the ObjectiveC bindings. the function has been added to the webkit-glib-gtk bindings just in case anyone feels like using it.

anyone wishing to use the glib/gobject DOM model directly, in c, is well advised to look at the example modified WebKitTools/GtkLauncher/main.c which can be found here: http://lkcl.net/pyjamas-desktop/main.c this modified example is not "gobject-perfect" - there are a couple of areas where an experienced gobject programmer will spot ref-count losses that have yet to be addressed, however the code does some really quite sophisticated messing-about of the DOM model, and provides genuinely useful code snippets. developers may be intrigued to know that some of the code-snippets, such as get_absolute_top(), are direct ports from pyjamas-desktop of the DOM.py getAbsoluteTop() function, which was in turn itself a direct port from the javascript code inside pyjamas DOM.py of the same function name. the technique, and the examples, will help other developers wishing to write applications, by first writing or sourcing an example written in javascript, and then following the same conversion techniques as can be seen by comparing DOM.py getAbsoluteTop() with the example main.c get_absolute_top().

anyone wishing to provide bindings to other languages, such as ruby, perl or java: the pygtk-codegen-2.0 application pretty much made mincemeat of webkit.defs (available on request, or look at code.google.com/p/pywebkitgtk issue #13 - i may update the patch soon enough) and absolutely _no_ funny business - overrides of _any_ kind - were required! the only "funny business" that's in pywebkitgtk overrides is to do with gtk, not the webkit gobject bindings. 300 objects, 1500 functions and tens of thousands of properties all get added with a vanilla .defs file. unbelievable. so this spells "good news" for the garbage-collecting languages (e.g. ruby, perl, possibly java): if your language-of-choice's gobject-auto-generator is as good as python-gobject's auto-generator, you should be up-and-running within literally a couple of hours. oh - but first: i would advise you to look at pywebkitgtk's "demobrowser.py" for guidance on how to create a webkit gtk app (using your language of choice) first, followed by looking at pyjamas-desktop's "pyjd.py" for further hints on how to bind to the DOM model functions [pyjd.py is based on demobrowser.py].

c++ is a different matter. webkitgtkmm will _not_ be gaining DOM bindings based on webkit.defs. after discussions with jonathon jongsma, we came to the conclusion that it would be far better to write a _separate_ set of bindings (gobjectmm) actually in webkit, due to subtle information being available that is lost by the time you get to webkit-gobject c-bindings. anyone anticipating to write or have webkitgtkmm "up-and-running", providing gtk / gobject bindings to webkit's DOM model, should expect to take between three and four weeks in writing a CodeGeneratorGobjectMM.pl, using the other WebKit CodeGenerators as guides.

that's all, for now. bugs should be reported to the respective bugtrackers of the appropriate projects - http://code.google.com/p/pyjamas, http://code.google.com/p/pywebkitgtk and http://bugs.webkit.org should do the trick.