Friday, November 26, 2010

Personal: Wife in Labor

Wife in labor. Built Twilio app to time contractions. Call it to try it out: (866) 948-3615.

Friday, November 19, 2010

Software Engineering: Coping When You Must Repeat Yourself

These days, most software engineers are familiar with the acronym "DRY" which stands for "don't repeat yourself". In general, it's good advice. In a perfect world, no code would ever be duplicated and every piece of "truth" would only exist in one place. However, the real world isn't quite so perfect, and things are far less DRY than you might imagine. The question is, how do you cope?

First let me show you some reasons why you can't always keep it DRY:

Often, the same truth is duplicated in the code, the tests, and the docs. Duplicating the same piece of truth in the code and in the tests helps each verify the other. Generally, the code is more general than the tests (the tests verify examples of running the more general code), but the duplication is there. When you update one (for instance to change the API), you'll need to update the other. This is a case where not keeping it DRY pays off--if you have to update the tests, that's a reminder that you'll also have to update all the other consumers of your API.

Similarly, the API docs often duplicate the truth that is built in to the code. That's because it's helpful to explain the truth to the computer in one way (using very precise code) and explain the truth to the reader in another way (using friendly, high-level English). Every truthful comment duplicates what the code already says, but not every piece of code is easily and quickly readable by human readers--this is especially true in, say, assembly.

Another area where truth is duplicated is in APIs. The function defines a name and an API. The caller uses it. They must agree on these things or the code won't work. If the caller decides to use a different name or a different API, the code will break. Essentially, programmers have decided that it's better to duplicate the name and the API rather than duplicate the contents of the function. This points to a useful trick--sometimes a small amount of duplication saves a large amount of duplication. You'll also see this sometimes in comments when they say "see also..."

Another source of duplication concerns public vs. private. For instance, in C, the same API is duplicated in the .h file and the .c file. Sometimes, the same piece of code must be duplicated in different projects. For instance, one operating system might need to define the same C types as another operating system because there's no easy way for them to share the same header files.

At a higher level, one time I had to add the same function I wrote to two projects. One project was proprietary company code. The other was open source (I had permission, of course). For technical reasons, it was impractical for the company code to import or subclass the open source code, so I was stuck just duplicating it.

Often, you'll need to duplicate the same piece of truth in multiple languages. For instance, think of how many HTTP client libraries there are in all the different programming languages. It doesn't matter how good an HTTP client library is if it's not easily accessible from the programming language I'm currently coding in. Sometimes there will be multiple HTTP client libraries for the same language because they're implemented differently (for instance, syncronously vs. asyncronously).

I mentioned tests before. Often tests duplicate some setup or teardown or perhaps the same pattern of interacting with a function. Refactoring is sometimes appropriate, but not always. It is commonly held that this is one area where keeping it DRY is less important than keeping it simple and isolated. A perfectly DRY collection of unittests that is difficult to comprehend and difficult to debug when something fails is less helpful than a set of simple, isolated unittests that contain a small amount of duplication. If the duplication causes multiple tests to fail, you'll know to keep fixing the tests until they all pass.

The question remains, how do you cope when you can't keep it DRY?

Greppability is very important. (By grep, I mean any tool that can search for a string or regular expression. I don't necessarily mean the UNIX tool "grep".) In highly dynamic languages like Ruby (that have great facilities for metaprogramming, but no static typing or interfaces) and highly factored frameworks like Rails (that use lots of files and levels of indirection), even a brilliant IDE can fail in comparison to a simple "grep tour". If you refactor a class in Ruby, how will you remember to refactor all the mocks of that class? You might have a user of your class that has a mock of your class that still makes use of your old API. The tests might be passing even though the code will assuredly crash. If you use grep, you can update all the callers of your class as well as all the mocks of the class. Grep can also help you find instances of a string in non-code documentation, and it even occasionally works with binary files. My point is, don't underestimate the utility of grep. Rather, you should aim for greppability. A function named "f" is not greppable, but a function named "calculate_apr" is. (By the way, naming all your loop variables "iterator" does not improve greppability, it just wastes time.)

Another way of coping when things aren't DRY is to have cross referencing comments. If you know that you must duplicate the same piece of truth in 5 places, add a comment next to each of those 5 places that refers to the other 5 places. Don't be afraid to duplicate the comment exactly. Your comment can say something like, "If you change this, don't forget to update..."

Another thing that helps mitigate duplication is proximity. Docstrings belong in the code because if a programmer updates one, he'll be more likely to update the other (although even proximity can't always help lazy programmers). If all the API documentation is in a separate file, that file will go stale very quickly.

Parallelization also helps. For instance, this code has a small amount of duplication:
 some_a = 1
some_a.invoke_method()
register(some_a)
call_something_unrelated()
some_b = 2
some_b.invoke_method()
register(some_b)
Sometimes you can factor out this duplication. However, in less dynamic languages like C, it may not always be easy to do so. However, parallelization can really help:
 some_a = 1
some_b = 2
some_a.invoke_method()
some_b.invoke_method()
register(some_a)
register(some_b)
call_something_unrelated()
Another old trick for coping with duplication is to have one source generate the other. Generating API documentation using javadoc is a good example of this. Sometimes you can use a program to generate code for multiple programming languages. There's another example of "generation" that I sometimes use in Python. I use string interpolation when creating docstrings. For instance, if there's a piece of documentation that should be duplicated in multiple places, string interpolation makes it possible so that I only have to write that piece of documentation once.

Another source of duplication has to deal with the plethora of tools programmers must use. There is the source code itself, a revision control system, a bug tracker, and a wiki. Often times, the same piece of truth needs to be duplicated in all of these places. This is one place where Trac really shines. Once you properly configure Trac, you can reference the bug number in each of your commits. Trac's commit hook will take that commit and add it as a comment in the original bug with a reference to the source code in Trac's source code viewer. Hence, Trac (which is a bug tracking system, a wiki, and a source code viewer) and the revision control system work together to reduce duplication.

It's unfortunate that life isn't always as DRY as you'd like it to be. However, keeping a few tricks in mind can really help mitigate the problems caused by having to duplicate a piece of truth in more than once place. If you have other tricks, feel free to leave them in a comment below.

Thursday, November 18, 2010

JavaScript: Naughty Socket.IO Example

File this under the "things you probably shouldn't do, but are fun anyways" category. Socket.IO is a library for Node.JS that provides Comet using a plethora of different approaches (WebSocket, Flash socket, AJAX long polling, etc.). I hacked the Socket.IO chat demo so that it reads HTML from my terminal and just dumps it to the browser. Hence, I can control people's browsers from my terminal. Insecure? Yeah. Fun? Oh yeah!

Anyway, here's how I hacked the server.js file in Socket.IO's chat demo:
io.on('connection', function (client) {

// Read from /dev/tty and send it to the browser.
var stream = fs.createReadStream('/dev/tty', {encoding: 'ascii'});

stream.on('error', function (exception) {
client.send({announcement: 'Exception: ' + exception});
});

stream.on('data', function (data) {
client.send({html: data});
});
...
And here's how I hacked chat.html:
function message(obj) {
var el = document.createElement('p');
if ('html' in obj) el.innerHTML = obj.html;
...
Here's what it looks like in my terminal:
sudo ./server.js
18 Nov 10:14:04 - socket.io ready - accepting connections
18 Nov 10:14:06 - Initializing client with transport "websocket"
18 Nov 10:14:06 - Client 5832344139926136 connected

<i>I'm typing this in to control the page.</i>
<script>alert('Oh baby!');</script> This doesn't work with innerHTML, thankfully.
<ul><li>Node.JS</li><li>Socket.IO</li></ul>

Thursday, November 11, 2010

Jobs: Looking for People to Work With Me

I'm really enjoying myself here at Twilio. We're looking for a few more people, and I wonder if any of my readers would like to come work with me.

Twilio makes it easy for normal web developers to write voice and SMS enabled applications. If you don't know what I mean, try calling my app: (888) 877-7418. By the way, Jeff Lindsay, the SHDH house guy, is here too.

Here are the positions we're hiring for:
  • DevOps Engineer
  • Senior Software Engineer
  • Core Team
  • Software Engineering Leader, Organizer, Mentor
  • Customer Advocate
  • Developer Evangelist
  • Product Manager
We use a mix of Python, Java, PHP, and Ruby. We're in San Francisco. We just closed a second round of funding, but we also make a lot of money.

Here are the actual job postings. Contact Joanna Samuels for more information.

Tuesday, November 09, 2010

Books: Digital At Work: Snapshots From The First Thirty-Five Years

I just finished reading Digital At Work: Snapshots From The First Thirty-Five Years.
"Digital At Work" tells the story of the first thirty-five years of Digital Equipment Corporation [DEC] and illuminates the origins of its unique culture. First person accounts from past and present members of the Digital community, industry associates, board members, and friends - plus a wealth of photos from Digital's archives - trace the company's evolution from the 1950s to present.
In short, I really enjoyed it. By reading this book, I was able to vicariously experience the growth and history of one of the most significant companies in the history of computing, and it definitely left an emotional impact.

I think one of the most interesting things about Digital was its culture. Some people might call it chaos. Other people might call it a meritocracy. It was definitely in the MIT tradition. It wasn't uncommon to get into shouting matches over which approach to take. Good ideas were always more important than what little company hierarchy existed.

Here are a few random, interesting quotes I jotted down while reading the book. I left out the page numbers because you can just search for them in the PDF:
Fortune magazine’s report in the late 1950s that no money was to be made in computers suggested the word itself be avoided in Digital’s first business plan.

If you had to design a modern computer with the tools we had, you couldn’t do it. But to build the first computer was an eminently doable thing, partly because we could design something that we could build.

Many of Sketchpad’s capabilities were sophisticated even by the workstation standards of the 1980s. “If I had known how hard it was to do,” Sutherland said later, “I probably wouldn’t have done it.”

Six MIT students, including Alan Kotok and Peter Samson, bet Jack Dennis, who ran the PDP-1, that they could come up with their own assembler in a single weekend. They wrote and debugged it in 250 man-hours, and it was loaded onto the machine when Dennis came to work on Monday. This was the sort of job that might have taken the industry months to complete.

Success depended on extraordinary personal commitments, often creating high levels of personal stress. “The atmosphere has always been that of small groups of engineers with extremely high energy, working hard and aggressively for long, long hours-always on the edge of burnout,” says Jesse Lipcon. “That can be both positive and negative.”

“We didn’t have much experience,” says Cady, “but we were energetic, enthusiastic, and too dumb to know what we were doing couldn’t be done. So we did it anyway."

And, of course, we disagreed with much of what the original committee had done. So in the best Digital tradition, while creating the impression that the specs were frozen and we were just fixing some bugs, we surreptitiously went around changing many things, simplifying the protocols as much as we could get away with.

Primarily, architecture is the ability to take complex problems and structure them down into smaller problems in a simple, tasteful, and elegant way.

“It worked out that there were about a million lines of code in each new version of VMS,” says Heffner. “The first version was about a million lines of code, and by the time we got to Version 5, there were 5 million lines of code. We’re talking about a really heavy-duty operating system here, with more functionality than the world at that time had ever known.

We’d assign new kids to a senior person who would look after them, like an apprentice. Managing a good software engineer is like raising a kid-you want them to get into a little bit of trouble, but you don’t let them burn down the house.

[In Galway, Ireland] we were the only nonunion shop around, we paid well, and we did a lot of employee training so people could move up to higher-paying jobs very quickly. The hierarchy between workers and management was invisible.

This book is wonderful...As for people just entering the computer field, they will get a sense of how wonderfully uncomplicated things were, how exciting and liberating the challenges were, and how much actually got done.

Thursday, November 04, 2010

Linux: The Tiling Window Manager I Wish I Had

Every year or two, I switch to a tiling window manager such as xmonad or dwm. Inevitably, I switch back to GNOME after a couple weeks. Sometimes it's because the window manager doesn't fit in with the rest of my GNOME desktop (it used to be non-trivial to get xmonad to work with GNOME's panel). Sometimes it's because of bugs related to having a weird window manager (NetBeans used to freak out with xmonad, and Flash refused to go full-screen). Every time I try again, a bunch of things have improved. xmonad even had a project aimed at making it more accessible to GNOME users. Still, I think the biggest problem I have is that tiling window managers make some assumptions that just don't work out for me in practice.

I use more than just terminals. I still like to use things like GVim, the GIMP, Google Chrome, a graphical chat client, etc. In fact, I even get a real kick out of writing GUIs. Some tiling window managers assume you're going to live in a terminal (which is partially true), and they only give special attential to GUIs like the GIMP as an afterthought. The ramifications of that attitude tend to be frustrating in practice.

Ease of use is important. I'm as good at reading man pages as the next guy, but it's even better when I don't have to. Furthermore, every minute that I spend reading man pages is time I'm not a) getting my real work done b) actually using the software. It's really helpful if I can switch to a new window manager without spending a day trying to memorize all the key bindings or struggling to get it to work on my Ubuntu GNOME desktop. Don't get me wrong--I love hot keys, but it's better when I can learn them as I'm using the software (Firefox is like that). Furthermore, graphical configuration utilities are helpful, but don't go overboard. Whenever possible, I shouldn't have to configure it to do the right thing; it should do the right thing by default.

I want to see my desktop background. I'm a little bit on the obsessive compulsive side. Hence, my desk is always as empty as possible. I feel uncomfortable when it's messy. In the same way, when my desktop has only a couple windows open and I can see the desktop background, I feel like things are calm and under control. If my entire desktop is covered with text, I feel like I'm out of control. There's no reason why by default my 27" monitor needs to be completely covered in order for me to edit a single file. (Note, I make heavy use of virtual desktops in order to keep each individual desktop sparse and well organized.)

I want window decorations. More specifically, since I have a 27" monitor, I can afford to have more than a single pixel in order to separate windows. If there are nice borders between windows, it looks cleaner and more under control. Having text from two windows be only a single pixel apart makes me feel uncomfortable. Nice raised, lowered, etc. borders have been an important tool for UI designers for several decades--use them. Furthermoe, I want buttons. Buttons help you get started using software you're unfamiliar with. I know xmonad has done some work on this. If the buttons have mouse-over text containing the hot-keys, that's even better because it means I can learn the software and become faster as I'm using the software.

Windows should only expand when I ask them to. I don't need a confirm dialog to cover my entire 27" monitor. Most GUI windows know how big they should be by default, so it's best to respect that by default. This is really important with the GIMP. However, it's also important with things like GVim. By default, I want my terminal and GVim to be 80 columns wide. However, I want buttons on the bottoms and sides of the window so that I can tell the window manager that I want the window to maximize horizontally or vertically. For instance, I often want GVim to maximize vertically, but not horizontally. I want my terminal to be 80x24 unless I need to read some log file, in which case I'll press the buttons to maximize the window.

Overlapping windows are not the end of the world. A window that's too small is useless. In fact, when I use GNOME, I often stagger my windows. The windows should tile as long as there's room, but when there's not enough room, they should start to stagger. As long as I can see part of the window in order to recognize it and click on it to select it, that's fine. I also don't feel that it's strictly necessary to stick to a grid; think of how you pack a box with random items, and you'll know what I mean.

Small is not always better. My laptop has 4 gigs of RAM. Hence, it doesn't really matter if the window manager fits in 4k or 8k of RAM. It doesn't even matter to me if it takes 100 mb of RAM. Lynx is cool, but most of the time I use Google Chrome. Don't get me wrong; I like small, simple, efficient software as much as the next guy, but I also like software that's smart, friendly, and helpful. Let's face it, these days, my computer is much faster and has a lot more memory than I do, so let's optimize software for me, not the computer. Of course, let's not forget that large software can still be conceptually simple.

What I have described doesn't perfectly fit the model of a tiling window manager. What I have described is a normal window manager that has the personality of a tiling window manager. The things I like most about tiling window managers is that they are a) innovative b) helpful. I think those same characteristics would continue to be true for a hybrid window manager that behaved somewhere between a normal window manager and a tiling window manager.

Wednesday, November 03, 2010

JavaScript: A Second Impression of NodeJS

When I first heard about NodeJS, my reaction was, "Why would I use JavaScript on the server when there are similar continuation-passing-style, asynchronous network servers such as Twisted and Tornado in Python already? Python is a nicer language. Furthermore, I prefer coroutine-based solutions such as gevent and Concurrence." However, after watching this video, Ryan Dahl, the author of NodeJS, has convinced me that NodeJS is worthy of attention.

First of all, NodeJS is crazy fast. Dahl showed one benchmark that had it beating out Nginx. (However, as he admitted, it was an unfair comparison since he was comparing NodeJS serving something out of memory with Nginx serving something from disk.) It's faster than Twisted and Jetty. That last one surprised me.

Dahl argued against green thread systems and coroutine-based systems due to the infrastructural overhead and magic involved. He argued that he doesn't like Eventlet because it's too magical both at an implementation level and also because he doesn't like multiple stacks. As I said, I'm not at all convinced by his arguments, but it reassures me that he was at least able to talk about such approaches. When I brought them up during Douglas Crockford's talk on concurrency, Crockford just gave me a blank, dismissive stare.

Dahl argued that by using callbacks for all blocking calls, it's really obvious which functions can block. As much as I dislike "continuation-passing-style", he makes a good point.

Dahl argued that NodeJS has an advantage over EventMachine (in Ruby) and Twisted (in Python) because JavaScript programmers are inherently comfortable with event-based programming. Dahl also argued that JavaScript is well suited to event-based programming because it has anonymous functions and closures. Ruby and Python has those things too, but Dahl further argued that it's very easy to accidentally call something in Ruby or Python that is blocking since it's not easy to know if something blocks or not. In contrast, NodeJS is built from the ground up so that pretty much everything is non-blocking.

NodeJS has built in support for doing DNS asynchronously, and it supports TLS (i.e. SSL). It also supports advanced HTTP features such as pipelining, chunked encoding, etc.

NodeJS has an internal thread pool for making things like file I/O non-blocking. That was a bit of a surprise since I wasn't even sure it could do file I/O. The thread pool is only accessible from C since he doesn't feel most programmers can be trusted with threads. He feels most programmers should only be trusted with the asynchronous JavaScript layer which is harder to screw up.

Dahl still feels that it's important to put NodeJS behind a stable web server such as Nginx. He admits that NodeJS has lots of bugs and that it's not stable. He's not at all certain that it's free of security vulnerabilities.

In general, Dahl believes you'll only need one process running NodeJS per machine since it is so good at not blocking. However, it makes sense to use one process per core. It also makes sense to use multiple processes when you need to do heavy CPU crunching. At some point in the future, he hopes to add web workers a la HTML5.

NodeJS has a REPL.

Dealing with binary in NodeJS is non-optimal because dealing with binary in JavaScript sucks. NodeJS has a buffer class that sits outside V8. V8's memory management makes it impossible to expose pointers since memory may be moved around the heap by the garbage collector. Dahl would prefer to deal with binary in a string, but that's not currently possible. Furthermore, pushing a big string to a socket is currently slow.

Dahl works for Joyent.

Although I still feel Erlang has a real edge in the realm of asynchronous network servers, Erlang is difficult for most programmers to adapt to. I think NodeJS is interesting because it opens up asynchronous network programming to a much wider audience of programmers. It's also interesting because it allows you to use the same programming language and in some cases the same libraries on both the client and the server. I'm currently looking at NodeJS because I want to use socket.io for Comet aka "real time" programming. Only time will tell how NodeJS works out in practice.