Tuesday, August 26, 2008

Linux: Trac and Subversion on Ubuntu with Nginx and SSL

I just setup Trac and Subversion on Ubuntu. I decided to proxy tracd behind Nginx so that I could use SSL. I used ssh to access svn. I got email and commit hooks for everything working. I used runit to run tracd. In all, it took me about four days. Here's a brain dump of my notes:
Set up Trac and Subversion:
Setup runit:
touch /etc/inittab # Latest Ubuntu uses "upstart" instead of the sysv init.
apt-get install runit
initctl start runsvdir
initctl status runsvdir
While still on oldserver, I took care of some Trac setup:
Setup permissions:
See: http://trac.edgewall.org/wiki/TracPermissions
trac-admin:
permission list
permission remove anonymous '*'
permission remove authenticated '*'
permission add authenticated BROWSER_VIEW CHANGESET_VIEW FILE_VIEW LOG_VIEW MILESTONE_VIEW REPORT_SQL_VIEW REPORT_VIEW ROADMAP_VIEW SEARCH_VIEW TICKET_CREATE TICKET_MODIFY TICKET_VIEW TIMELINE_VIEW WIKI_CREA
TE WIKI_MODIFY WIKI_VIEW
Note: The above matches the default, but with no anonymous access.
permission add jj TRAC_ADMIN
Went through the admin section in the GUI and setup everything.
Fixed inconsistent version field ("" vs. None):
sqlite3 db/trac.db:
update ticket set version = null;
apt-get install subversion-tools python-subversion
apt-get install python-pysqlite2
easy_install docutils:
/usr/bin/rst2newlatex.py
/usr/bin/rst2xml.py
/usr/bin/rstpep2html.py
/usr/bin/rst2s5.py
/usr/bin/rst2latex.py
/usr/bin/rst2pseudoxml.py
/usr/bin/rst2html.py
easy_install pygments:
/usr/bin/pygmentize
easy_install pytz
Setup users:
Used "adduser" to create users.
Grabbed their passwords from /etc/shadow on oldserver.
addgroup committers
Added the users to the committers group.
Setup svn:
mkdir -p /var/lib/svn
svnadmin create /var/lib/svn/example
Copied our svn repository db from oldserver to /var/lib/svn/example/db.
chgrp -R committers /var/lib/svn/example/db
Setup trac:
easy_install Trac:
/usr/bin/trac-admin
/usr/bin/tracd
+Genshi-0.5.1-py2.5-linux-i686.egg
mkdir -p /var/lib/trac
cd /var/lib/trac
trac-admin example initenv:
I pointed it at the svn repo path, but otherwise used the default
settings.
Copied stuff from our trac instance on oldserver to
/var/lib/trac/example/attachments and /var/lib/trac/example/db.
I chose not to keep our trac.ini since Trac has changed so much.
I chose not to keep our passwords file since they were too easy.
htpasswd -c /var/lib/trac/example/conf/users.htpasswd jj
Edited /var/lib/trac/example/conf/trac.ini.
adduser trac # Used a throwaway password.
vipw # Changed home to /var/lib/trac and set shell to /bin/false.
chown -R trac:trac /var/lib/trac # Per the instructions. Weird.
find /var/lib/trac/example/attachments -type d -exec chmod 755 '{}' \;
find /var/lib/trac/example/attachments -type f -exec chmod 644 '{}' \;
trac-admin /var/lib/trac/example resync
Setup trac under runit:
Setup logging:
mkdir -p /etc/sv/trac/log
mkdir -p /var/log/trac

cat > /etc/sv/trac/log/run << __END__
#!/bin/sh

exec 2>&1
exec chpst -u trac:trac svlogd -tt /var/log/trac
__END__

chmod +x /etc/sv/trac/log/run
chown -R trac:trac /var/log/trac
Setup trac:

cat > /etc/sv/trac/run << __END__
#!/bin/sh

exec 2>&1
exec chpst -u trac:trac tracd -s --hostname=localhost --port 9115 --basic-auth='*',/var/lib/trac/example/conf/users.htpasswd,'24 Hr. Diner' /var/lib/trac/example
__END__

chmod +x /etc/sv/trac/run
ln -s /etc/sv/trac /etc/service/
Setup Nginx to proxy to Trac and handle SSL:
cd /etc/nginx
openssl req -new -x509 -nodes -out development.example.com.crt \
-keyout development.example.com.key
Edit sites-available/default.
/etc/init.d/nginx restart
Setup post-commit hook:
cd /var/lib/svn/example/hooks
wget http://trac.edgewall.org/browser/trunk/contrib/trac-post-commit-hook?format=txt \
-O trac-post-commit-hook
chmod +x trac-post-commit-hook
cp post-commit.tmpl post-commit
chmod +x post-commit
Edited post-commit.
mkdir /var/lib/trac/example/.egg-cache
chown -R trac:committers \
/var/lib/trac/example/.egg-cache \
/var/lib/trac/example/db
chmod 775 /var/lib/trac/example/.egg-cache \
/var/lib/trac/example/db
chmod 664 /var/lib/trac/example/db/trac.db
Setup trac notifications:
Edit /var/lib/trac/example/conf/trac.ini.
sv restart trac
Here's the most important part of Nginx's sites-available/default:
# Put Trac on HTTPS on port 9443.
server {
listen 9443;
server_name development.example.com;

access_log /var/log/nginx/development.access.log;
error_log /var/log/nginx/development.error.log;

ssl on;
ssl_certificate /etc/nginx/development.example.com.crt;
ssl_certificate_key /etc/nginx/development.example.com.key;

ssl_session_timeout 5m;

ssl_protocols SSLv2 SSLv3 TLSv1;
ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP;
ssl_prefer_server_ciphers on;

location / {
root html;
index index.html index.htm;
proxy_pass http://127.0.0.1:9115;
}
}
Here's the most important part of svn's post-commit hook:
REPOS="$1"
REV="$2"
MAILING_LIST="commits@example.com"
TRAC_ENV="/var/lib/trac/example"

/usr/share/subversion/hook-scripts/commit-email.pl "$REPOS" "$REV" \
"$MAILING_LIST"
/usr/bin/python /var/lib/svn/example/hooks/trac-post-commit-hook \
-p "$TRAC_ENV" -r "$REV"
Here are the changes I made to trac.ini:
===================================================================
--- var/lib/trac/example/conf/trac.ini (revision 464)
+++ var/lib/trac/example/conf/trac.ini (revision 475)
@@ -58,13 +58,13 @@
mime_encoding = base64
smtp_always_bcc =
smtp_always_cc =
-smtp_default_domain =
-smtp_enabled = false
-smtp_from = trac@localhost
+smtp_default_domain = example.com
+smtp_enabled = true
+smtp_from = trac@development.example.com
smtp_from_name =
smtp_password =
smtp_port = 25
-smtp_replyto = trac@localhost
+smtp_replyto = ops@example.com
smtp_server = localhost
smtp_subject_prefix = __default__
smtp_user =
@@ -152,7 +152,7 @@
authz_file =
authz_module_name =
auto_reload = False
-base_url =
+base_url = https://development.example.com:9443
check_auth_ip = true
database = sqlite:db/trac.db
default_charset = iso-8859-15
@@ -166,7 +166,7 @@
repository_type = svn
show_email_addresses = false
timeout = 20
-use_base_url_for_redirect = False
+use_base_url_for_redirect = True

[wiki]
ignore_missing_pages = false
Wow, that was painful!

Monday, August 25, 2008

Books: Basics of Compiler Design

I started reading Basics of Compiler Design. I think, perhaps, it might have helped if I had actually taken the course rather than simply try to read the book.

Here's a simple rule of thumb:
Never use three pages of complicated mathematics to explain that which can be explained using either a simple picture or a short snippet of pseudo code.
The section on "Converting an NFA to a DFA" had me at the point of tears. After a couple hours, I finally understood it. However, even after I understood it, I knew I could do a better job teaching it. A little bit of Scheme written by the SICP guys would have been infinitely clearer.

I hate to be harsh, but it seemed like the author was just having a good time playing with TeX. I picked this book because it was short and didn't dive into code too much. What I found is that it uses math instead of code. I'd prefer code.

The worst part of reading this book by myself is that even if I make it to the end, I won't know if I truly mastered the material because I won't have a compiler to show for my work. After all, there's no one around to grade my written assignments, and the book doesn't actually take you all the way through writing a real compiler.

Thursday, August 21, 2008

Humor: I've Been Simpsonized!

Thanks to Dean Fraser (jericho at telusplanet dot net) at Springfield Punx for the artwork.

Books: The Art of UNIX Programming

I just finished reading The Art of UNIX Programming. In short, I liked it a lot.

Here are a few fun quotes:
Controlling complexity is the essence of computer programming -- Brian Kernighan [p. 14]
Software design and implementation should be a joyous art, a kind of high-level play...To do Unix philosophy right, you need to have (or recover) that attitude. [p. 27]
Microsoft actually admitted publicly that NT security is impossible in March 2003. [p. 69, Unfortunately, the URL he provided no longer works.]
One good test for whether an API is well designed is this one: if you try to write a description of it in purely human language (with no source-code extracts allowed), does it make sense? It is a very good idea to get into the habit of writing informal descriptions of your APIs before you code them. [p. 85, this is a good explanation for why I write docstrings before I write code.]
C++ is anticompact--the language's designer has admitted that he doesn't expect any one programmer to ever understand it all. [p. 89]
One thing Raymond does very well is document things that the rest of us implicitly assume. For instance, he described the various cultures revolving around UNIX. Now I know why I'm so mixed up! I sympathize with several different cultures such as:
  • Old-school UNIX hackers
  • The Open Source movement
  • The Free Software movement
  • BSD hackers
  • MIT Lisp hackers
  • The IETF
My copy of the book is from 2004, and as timeless as this book is, I still wish I could get a "post-modern" opinion on several topics. For instance:
  • Linux is so commonplace these days, what should we do now that everyone takes it for granted?
  • OS X has really won the hearts of a lot of developers. Is there any hope that the rest of the world will move closer to the Free Software ideal? (Please see my post A Hybrid World of Open and Closed Source Software.)
  • I'd love to get his take on Eclipse, TextMate, and modern-day Emacs and Vim.
  • I'd also love to get his opinions on Ruby and Rails.
In general, I think it's a fair critique that there weren't enough critiques of Unix. He mostly saved them until the last chapter. I would have enjoyed more critiques throughout. As much as I love Unix, one of my favorite books is The UNIX-HATERS Handbook.

Similarly, all of his discussion on Emacs vs. Vi seemed a bit biased. I know it's hard not to be biased on this topic, but I was a bit frustrated when he called all of Emacs' complexity "optional complexity" and all of Vi's complexity "accidental and ad-hoc complexity." Because of his statements I even gave Emacs another shot. However, as usual, I was reminded that in theory Emacs is my favorite editor, but in practice I'm a Vim user.

Nonetheless, I do have high praise for this book. When I was totally burnt out and couldn't code for two months, I found this book refreshing and relaxing. I owe Raymond my thanks :)

Tuesday, August 19, 2008

Python: the csv module and mysqlimport

Here's one way to get Python's csv module and mysqlimport to play nicely with one another.

When exporting something with the csv module, use:
csv.writer(fileobj, dialect='excel-tab', lineterminator='\n')
When importing with mysqlimport, use:
mysqlimport \
--user=USERNAME \
--password \
--columns=COLUMNS \
--compress \
--fields-optionally-enclosed-by='"' \
--fields-terminated-by='\t' \
--fields-escaped-by='' \
--lines-terminated-by='\n' \
--local \
--lock-tables \
--verbose \
DATABASE INPUT.tsv
In particular, the "--fields-escaped-by=''" took me a while to figure out. Hence, the csv module and mysqlimport will agree that '"' is escaped via '""' rather than '\"'.

Wednesday, August 13, 2008

Math: pi

As of today, I am roughly 33π×107 seconds old.

Saturday, August 09, 2008

Linux: LinuxWorld, BeOS, Openmoko

I went to LinuxWorld Conference & Expo again this year like I always do. My mentor Leon Atkinson and I always go together. Here are a few notes.

There was a guy who had a booth for the New York Times. I asked him what it had to do with Linux. He said, "Nothing, but I've sold about 40 subscriptions in the last two days and made about $2000. Wanna buy a subscription?" I felt like I had been hit with a 5lb chunk of pink meat right in the face. There was another booth selling office chairs and another selling (I think) foot messages.

I didn't see Novell, HP, O'Reilly, Slashdot, GNOME, KDE, or a ton of other booths I expected to see. I talked with the lead editor at another "very large, but purposely unnamed" publisher, and he said that they wouldn't be back next year either.

There was a pretty cool spherical sculpture made of used computer parts. I was also pleased to see a bunch of guys putting together used computers and loading Linux on them for schools.

Other than that, I think LinuxWorld may be dead or dying. The editor of that publishing company said that this happens to conferences. They "run their course." Since Linux and FOSS were almost a religious experience for me when I was in college, I'm sorry to see LinuxWorld fizzle out.

I talked to the Haiku guys. I've been watching them. They're trying to rebuild BeOS. I knew that Palm bought Be's IP, so I asked them whatever happened to BeOS's source code. A very knowledgeable person gave me the whole rundown. The summary is that a company now owns it but can't release it for legal reasons. There's too much software in there that they can't get a clear copyright on, and they also have proprietary codecs that they're not allowed to release. He said that there was nothing to fear; Haiku is coming along nicely. They have some of the original BeOS developers, and they are staying true to the super-finely threaded nature of the original BeOS kernel. Unfortunately, it's not yet ready for production use, but they've come a long way.

I talked to a guy at the Openmoko booth. I told him that I'd be very interested in running Openmoko hardware, which is fully open, with Android, which I'm guessing will be relatively polished by the end of the year. He said that they had been talking to Google about it, but it's still up to Google to decide on a timetable. Unfortunately, Openmoko still isn't ready for everyday use yet. I'm waiting hopefully.

Wednesday, August 06, 2008

SICP: Truly Conquering SICP

This guy is my hero:
I’ve written 52 blog posts (not including this one) in the SICP category, spread over 10 months...Counting with the cloc tool (Count Lines Of Code), the total physical LOC count1 for the code I’ve written during this time: 7,300 LOC of Common Lisp, 4,100 LOC of Scheme.
Gees, and I was excited when I finished the videos. I feel so inadequate ;)

Python: sort | uniq -c via the subprocess module

Here is "sort | uniq -c" pieced together using the subprocess module:
from subprocess import Popen, PIPE

p1 = Popen(["sort"], stdin=PIPE, stdout=PIPE)
p1.stdin.write('FOO\nBAR\nBAR\n')
p1.stdin.close()
p2 = Popen(["uniq", "-c"], stdin=p1.stdout, stdout=PIPE)
for line in p2.stdout:
print line.rstrip()
Note, I'm not bothering to check the exit status. You can see my previous post about how to do that.

Now, here's the question. Why does the program freeze if I put the two Popen lines together? I don't understand why I can't setup the pipeline, then feed it data, then close the stdin, and then read the result.

Tuesday, August 05, 2008

Python: Memory Conservation Tip: Temporary dbms

A dbm is an on disk hash mapping from strings to strings. The shelve module is a simple wrapper around the anydbm module that takes care of pickling the values. It's nice because it mimics the dict API so well. It's simple and useful. However, one thing that isn't so simple is trying to use a temporary file for the dbm.

The problem is that shelve uses anydb which uses whichdb. When you create a temporary file securely, it hands you an open file handle. There's no secure way to get a temporary file that isn't opened yet. Since the file already exists, whichdb tries to figure out what format it uses. Since it doesn't contain anything yet, you get a big explosion.

The solution is to use a temporary directory. The next question is, how do you make sure that temporary directory gets cleaned up without reams of code? Well, just like with temporary files, you can delete the temporary directory even if your code still has an open file handle referencing a file in the temporary directory. Don't ya just love UNIX ;)

Here's some code:
import os
import shelve
import shutil
from tempfile import mkdtemp

tmpd = mkdtemp('', 'myprogram-')
filename = os.path.join(tmpd, 'mydbm')
dbm = shelve.open(filename, flag='n')
shutil.rmtree(tmpd)
# I can continue to use dbm for as long as I'd like.
On my system, the shelve module ends up using the dbm module which creates two files. Furthermore, my tests end up exercising this code in four different places. Despite all of that, since the tmpd is removed immediately, no matter how fast I type ls -l, I never even see the directory ;)

Monday, August 04, 2008

Python: Memory Conservation Tip: sort Tricks

The UNIX "sort" command is really quite amazing. It's fast and it can deal with a lot of data with very little memory. Throw in the "-u" flag to make the results unique, and you have quite a useful utility. In fact, you'd be surprised at how you can use it.

Suppose you have a bunch of pairs:
a b
b c
a c
a c
b d
...
You want to figure out which atoms (i.e. items) are related to which other atoms. This is easy to do with a dict of sets:
referrers[left].add(right)
referrers[right].add(left)
Notice, I used a set because I only want to know if two things are related, not how many times they are related.

My situation is strange. It's small enough so that I don't need to use a cluster. However, it's too big for such a dict to fit into memory. It's not too big for the data to fit in /tmp.

The question is, how do you get this sort of a hash to run from disk? Berkeley DB is one option. You could probably also use Lucene. Another option is to simply use sort.

If you open up a two-way pipe to the sort command, you can output all the pairs, and then later read them back in:
a b
a c
b c
b d
...
sort is telling me that a is related to b and c, b is related to c and d, etc. Notice, it also removed the duplicate pair a c, and took care of the temp file handling. Best of all, you can stream data to and from the sort command. When you're dealing with a lot of data, you want to stream things as much as possible.

Now that I've shown you the general idea, let me give you a couple more hints. First of all, to shell out to sort, I use:
from subprocess import Popen, PIPE
pipe = Popen(['sort', '-u'], bufsize=1, stdin=PIPE, stdout=PIPE)
I like to use the csv module when working with tab-separated data, so I create a reader and writer for pipe.stdout and pipe.stdin respectively. You may not need to in your situation.

When you're done writing to sort, you need to tell it you're done:
pipe.stdin.close()  # Tell sort we're ready.
Now here's the next trick. I don't want the rest of the program to worry about the details of piping out to sort. The rest of the program should have a nice clean iterator to work with. Remember, I'm streaming, and the part of the code that's reading the data from the pipe is far away.

Hence, instead of passing it a reference to the pipe, I instead send it a reference to a generator. That way the generator can do all the munging necessary, and no one even needs to know that I'm using a pipe.

The last trick is that when I read:
a b
a c
I need to recognize that b and c both belong to a. Hence, I use a generator I wrote called groupbysorted.

Putting it all together, the generator looks like:
def first((a, b)): return a
def second((a, b)): return b

def get_references():
"""This is a generator that munges the results from sort -u.

When the streaming is done, make sure sort exited cleanly.

"""
for (x, pairs) in groupbysorted(reader, keyfunc=first):
yield (x, map(second, pairs))
status = pipe.wait()
if status != 0:
raise RuntimeError("sort exited with status %s: %s" %
(status, pipe.stderr.read()))
Now, the outside world has a nice clean iterator to work with that will generate things like:
(a, [b, c])
(b, [c, d])
...
The pipe will get cleaned up as soon as the iterator is done.

Python: Memory Conservation Tip: Nested Dicts

I'm working with a large amount of data, and I have a data structure that looks like:
pair_counts[(a, b)] = count
It turns out that in my situation, I can save memory by switching to:
pair_counts[a][b] = count
Naturally, the normal rules of premature optimization apply: I wrote for readability, waited until I ran out of memory, did lots of profiling, and then optimized as little as possible.

In my small test case, this dropped my memory usage from 84mb to 61mb.

Saturday, August 02, 2008

Python: Memory Conservation Tip: intern()

I'm working with a lot of data, and running out of memory is a problem. When I read a line of data, I've often seen the same data before. Rather than have two pointers that point to two separate copies of "foo", I'd prefer to have two pointers that point to the same copy of "foo". This makes a lot of sense in Python since strings are immutable anyway.

I knew that this was called the flyweight design pattern, but I didn't know if it was already implemented somewhere in Python. (Strictly speaking, I thought it was called the "flywheel" design pattern, and my buddy Drew Perttula corrected me.)

My first attempt was to write code like:
>>> s1 = "foo"
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
True
>>> s1 is s2
False
>>> identity_cache = {}
>>> s1 = identity_cache.setdefault(s1, s1)
>>> s2 = identity_cache.setdefault(s2, s2)
>>> s1 == 'foo'
True
>>> s1 == s2
True
>>> s1 is s2
True
This code looks up the word "foo" by value and returns the same instance every time. Notice, it works.

However, Monte Davidoff pointed out that this is what the intern builtin is for. From the docs:
Enter string in the table of ``interned'' strings and return the interned string - which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup - if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys. Changed in version 2.3: Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.
Here it is in action:
>>> s1 = "foo"
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
True
>>> s1 is s2
False
>>> s1 = intern(s1)
>>> s2 = intern(s2)
>>> s1 == 'foo'
True
>>> s1 == s2
True
>>> s1 is s2
True
Well did it work? My program still functions, but I didn't get a tremendous savings in memory. It turns out that I don't have enough dups, and that's not where I'm spending all my memory anyway. Oh well, at least I learned about the intern() function.