The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.
Suppose you have a bunch of pairs:
a b
b c
a c
a c
b d
...
You want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:
referrers[left].add(right)
referrers[right].add(left)
Notice, I used a set because I only want to know if two things are related, not how many times they are related.
My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.
The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.
If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
a b
a c
b c
b d
...
sort is telling me that a is related to b and c, b is related to c and d, etc. Notice, it also removed the duplicate pair a c, and took care of the temp file handling. Best of all, you can stream data to and from the sort command. When you're dealing with a lot of data, you want to stream things as much as possible.
Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
from subprocess import Popen, PIPE
pipe = Popen(['sort', '-u'], bufsize=1, stdin=PIPE, stdout=PIPE)
I like to use the csv module when working with tab-separated data, so I create a reader and writer for pipe.stdout and pipe.stdin respectively. You may not need to in your situation.
When you're done writing to sort, you need to tell it you're done:
pipe.stdin.close() # Tell sort we're ready.
Now here's the next trick. I don't want the rest of the program to worry about the details of piping out to sort. The rest of the program should have a nice clean iterator to work with. Remember, I'm streaming, and the part of the code that's reading the data from the pipe is far away.
Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.
The last trick is that when I read:
a b
a c
I need to recognize that b and c both belong to a. Hence, I use a generator I wrote called
groupbysorted.
Putting it all together, the generator looks like:
def first((a, b)): return a
def second((a, b)): return b
def get_references():
"""This is a generator that munges the results from sort -u.
When the streaming is done, make sure sort exited cleanly.
"""
for (x, pairs) in groupbysorted(reader, keyfunc=first):
yield (x, map(second, pairs))
status = pipe.wait()
if status != 0:
raise RuntimeError("sort exited with status %s: %s" %
(status, pipe.stderr.read()))
Now, the outside world has a nice clean iterator to work with that will generate things like:
(a, [b, c])
(b, [c, d])
...
The pipe will get cleaned up as soon as the iterator is done.