Saturday, May 31, 2008

UNIX: comm = diff - formatting

The comm utility reads file1 and file2, which should be sorted lexically, and produces three text columns as output: lines only in file1; lines only in file2; and lines in both files. It's sort of twist on diff. It's nice because there's no formatting to get in the way.
$ cat > 1.txt << __END__                                 
1
2
3
__END__
$ cat > 2.txt << __END__
heredoc> 2
heredoc> 3
heredoc> 4
heredoc> __END__
$ comm 1.txt 2.txt
1
2
3
4
Thanks to Krishna Srinivasan for teaching me about a new old-school tool ;)

Friday, May 30, 2008

How to Protect Your Open Source Project from Poisonous People

I saw a fantastic talk yesterday: How to Protect Your Open Source Project from Poisonous People. If you help run an open source project, this is a must see.

Thursday, May 29, 2008

Thursday, May 22, 2008

Rant: UNIX vs. the Web

For all its strengths, developing for the Web has become a gigantic pain in the rear, especially when compared to the Unix style of development.

A few months ago, I joined a new startup. Almost all of it is backend processing that doesn't even use a database until almost all the work is done. Our only Web application is for a Web services API. Since I was mostly writing the code from scratch, my boss and I agreed that taking a Unix approach was best. Hence, we have a bunch of simple, standalone tools. Writing such tools is so refreshing. You know exactly what you need to do. They're only a few hundred lines long. You build a nice command line interface using optparse, you write some tests using nose, etc. It's all very straightforward and linear. Wanna know how to do a UNIX-style mashup? You use a pipe.

Recently, I went to work on a Web application again, and I realized just how much of a giant pain in the rear it is. Here are some of the things you need to think about. You need to be fluent in Python (or PHP, Ruby, etc.), HTML, CSS, SQL, and JavaScript. You have to have a Web server, a database, and possibly other auxiliary items like a load balancer. You have to think of the front end and the back end. Usually, you'll use a Web framework, a templating engine, and possibly an ORM. They're all probably young projects, so you'll have to be on the mailing list for each. Despite all the time you'll invest in these projects, you probably won't be using them still in five years.

Besides implementation concerns, the Web is simply complex these days. Have you read all the papers on session fixation attacks, cross site scripting vulnerabilities, SQL injection attacks, and cross site forgery attacks? Do you remember the HTTP response codes for SEE_OTHER and TEMPORARY_REDIRECT? Do you know when you should use each? I see tons of books about various Web frameworks and libraries, but where is there a really good book on how to be good at plain old Web development?

How do you deal with logins? OpenID hasn't really taken off yet, and not everyone can depend on Facebook for authentication. That means you'll need account services. What if the user forgets his password? Did you know that if you URL encode something that's been base64 encoded and then send it in an email, it might not make it through Hotmail in all cases? However, if you make the link too long, users will get confused when their email client breaks it into two lines.

The browser is now one of the most complicated pieces of software on a standard desktop. Everyone knows the DOM is a mess. How do you respond to events? There are at least three ways, and they're all painful for various reasons unless you're using a JavaScript framework. XHR, which is now a staple in the Web world, started life as a Microsoft hack. innerHTML is another such hack. Yet as useful and convenient as it is, it hasn't been blessed by the standards bodies. Seriously, who the heck thought of createNode, setAttribute, etc. in order to inject some HTML? It's very Java-ish and not very JavaScript-ish at all.

By the way, concerning standards bodies--you know, the ones who spent so much time creating XHTML?--I'll remind you that the Mozilla Web Author FAQ still says that "Serving valid HTML 4.01 as text/html ensures the widest browser and search engine support." I.e., use HTML not XHTML. The Webkit guys say pretty much the same thing.

But the fun doesn't end there. Need to do a Web request against a foreign domain and JavaScript won't let you? There's a workaround for that too. You just use a script tag and JSONP.

The Web is a strange place where the HACKS become the standard by which you get stuff done.

Once you get past all those difficulties, you're still stuck with something that's still not as snappy as a desktop app and a lot more difficult to code than an old school (i.e. no JavaScript) Web app. Essentially, the tough thing about Web apps is that they're large and effectively connectionless. Unless you're using something like Seaside, you have to handle each new Web request completely from scratch.
Hello, who are you? Oh, you have a cookie? Let me see if I know anything about you. Oh, the memcache server says it has a session for you. Let me talk to the database to see if he can tell me more. Ok, here's a form. Get back to me when you're ready.

Hello, who are you? Oh, you have a cookie? Let me see if I know anything about you...
And let's not forget that you're simultaneously carrying on about a hundred such conversations at any given time. You have all the drawbacks of multithreaded coding except, you really can't count on anything being in memory because you're spread across several servers. It's the worst of both worlds.

The nice thing about UNIX tools is that once you get them working, you don't need to think much about them anymore. When was the last time you worried much about cut, uniq, or sort? On the other hand, with a Web app, plan on rewriting it five years from now. Oh, and it'll be even harder and more messed up by then.

Of course, all my complaints just don't matter because the Web has too many good properties. It's vendor and OS neutral. You can run millions of different applications, and the only thing you need to download is a browser. (Oh wait, you already have one? An ancient version of IE? No worries, we can support that too!)

Yet again, we are reminded that worse is better. Apparently, much, much worse is also much, much better.

Friday, May 16, 2008

Python: Google App Engine: Cookie Users Beware

By default, Google App Engine Web applications runs on yourapp.appspot.com. That means that some other app, e.g. badguyz.appspot.com, can set a cookie for appspot.com, and your app will get that cookie from the user's Web browser on subsequent requests to your site.

This isn't some remarkable new exploit or anything. It's just something to keep in mind when running on subdomains like this. If you're worried about security, you should use your own domain name and cryptographically sign your cookies (here's some example source code).

Python: Debugging Google App Engine Apps Locally

Python has a wonderful interactive interpreter (i.e. shell). However, sometimes you need more setup before you can start coding. Previously, I wrote Python: Coding in the Debugger for Beginners.

Google App Engine works a bit like CGI in that output to STDOUT goes to the browser. This breaks my normal "import pdb; pdb.set_trace()" trick. However, it's not hard to put STDOUT, etc. back so that you can use pdb in the way you normally would:
for attr in ('stdin', 'stdout', 'stderr'):
setattr(sys, attr, getattr(sys, '__%s__' % attr))
import pdb
pdb.set_trace()

Wednesday, May 14, 2008

The Bipolar Lisp Programmer

Have you ever wondered how it could be that Lisp is so powerful, and yet C is so much more successful and ubiquitous? How is it that so many brilliant coders know Lisp, and yet we so rarely hear from any of them other than Paul Graham? This is a great article that tries to explain it: the bipolar Lisp programmer.

Thursday, May 08, 2008

Joel on Software: Never Rewrite from Scratch

I was thinking of Joel on Software's famous post Things You Should Never Do, Part I where he says, "[Netscape] did it by making the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch."

Since Joel is from Microsoft, I was pondering what would have happened if the Microsoft NT developers had taken that advice and based NT on DOS. Perhaps it's illustrative to compare the quality of Windows ME vs. Windows 2000 and XP.