Friday, June 29, 2007

Computer Science: Coping with Unknown Types

What do "void *" (a la C), polymorphism (a la C++ classes), interfaces (a la Java), generics (a la C++ templates), and duck typing (a la Python) all have to do with one another? They're all ways in which you can write code that works with types that you didn't envision when writing the code.

A "void *" in C is a pointer to something of unspecified type. You can't do very much with it unless you know what type the something is. However, you can still pass it around. You can store it in a list or tree. You can take it and later pass it back to a callback function. All of these things are useful, and, in fact, this functionality still exists in Java (albeit, it's a lot safer in Java). However, instead of casting to "void *", you cast to "Object".

Polymorphism in languages like C++ and Java let you take an object and call methods on it without necessarily knowing exactly which subclass the object is a member of. Let's suppose there is a class named Fruit with a method named peel, and let's suppose there are two subclasses named Apple and Orange. If you have a list of apples and oranges, you can loop over that list and call peel on each fruit. Even better, if someone later creates a Lemon class, and slips a few lemons into that list, your code will still know how to peel them.

However, what if you don't want to subclass Fruit? What happens if you have an object that knows how to peel itself as well as a ton of other things? Do you need to subclass multiple classes? An interface in Java (or a typeclass in Haskell) lets you say that your object knows how to peel itself, without requiring any specific subclassing. Instead, it can implement some Peelable interface, and that's close enough. Hence, instead of peeling a list of fruit, your code can now peel a list of objects that each implement the Peelable interface. Those objects might not be related at all, and they're free to implement all sorts of interfaces aside from just the Peelable interface.

Generics, which are called templates in Java and C++, let you write code and leave blanks in it that can be filled in later. Generics are an interesting subject, and the question really comes down to what kind of stuff can you leave blank?

Generics in Java are actually pretty weak. It use to be that if you wanted a list, you had to have a list of Objects (remember the "void *" trick?). You didn't know exactly what was in the list. These days, with Java templates, you can tell the compiler that you're creating a list, and that the list can only contain Apples. The list is a template, and you're "filling in the blank" with the type Apple. However, templates in Java are limited. For instance, you can't create a template that says "Create a new [blank]". (If I'm wrong, please leave a comment!)

C++ templates are more powerful. You can do all the same things that you can do in Java, but you can also do things like create a template that says "Create a new [blank]". The differences have to do with how the compiler implements templates. When you tell the compiler that you want to "fill in the blank" with an Apple, i.e. "Create a new Apple", that's called instantiating the template. By the way, this is something that happens at compile time. Now, let's suppose you have a template for lists, and you want a "list of Apple" and a "list of Orange". One way the compiler can implement this is to take the code for list and fill in all the blanks with Apple, then take the code for list and fill in all the blanks with Orange. You'd end up with two slightly different versions of the same code in the compiled binary. I don't know how modern C++ compilers do it, and feel free to call me ignorant, but it really makes me suspicious when I see how big C++ binaries are compared to C binaries ;)

Generics in functional programming languages like Haskell are even more impressive. If a function takes a fruit and then peels it, Haskell can automatically figure out that the function will work with any object that can be peeled. The impressive thing is that it can in many cases automatically infer this interface at compile time! You don't even have to tell the compiler that you're trying to write generic code. (Note, I'm handwaiving a little about when you do and don't need to use type classes.)

Duck typing (also known as latent typing) achieves the same goal, but does so using runtime checks. Hence, if you write a function that takes an object and peels it, you don't need to subclass anything or write an interface. However, at runtime, the interpreter will figure out if the object actually knows how to peel itself. On the one hand, you don't get as many compile-time safety checks, but on the other hand, it's really easy to understand. You can accept whatever objects you want, and call any methods you want, and if it doesn't actually work at runtime, you'll get a nice exception that you can respond to in a controlled manner. There's an old joke that says that C++ is like juggling chainsaws in full body armor, whereas Python is like juggling rubber chickens. Even better, you can do tricks like have the same object respond to any method. For instance, you can call any method on a proxy object, and it will just proxy that method call to the object it is acting as a proxy for. The same proxy object can proxy any object with any interface.

Ok, that was a pretty basic overview of a bunch of related language features. As I said, they're all ways in which you can write code that works with types that you didn't envision when writing the code. Now, take a minute and think about that problem and the many different ways to solve it. If you wrote a new language, how might you solve it differently? Leave me a comment below!

Wednesday, June 27, 2007

10 Reasons Big Projects Suck

Have you ever noticed that big projects inevitably get a bad rap? Here are 10 reasons why:

  1. Let's assume for a moment that there's one bug for every 100 lines of code. If a big project has 10 times as much code as a small project, it has 10 times as many bugs. In reality, because big projects are harder to understand and intrinsically harder to change quickly, it probably has more than 10 times as many bugs.

  2. If a big project implements some feature A, there is bound to be some bug in it. That proves that the big project is buggy. Furthermore, inevitably, the feature isn't exactly what you need. That means it's inflexible.

  3. If, on the other hand, the smaller project doesn't implement feature A, it can't possibly have the same bug the big project has. Hence, it's not buggy. Furthermore, since you'll need to implement feature A yourself, you'll probably implement exactly what you need. That means it's more flexible.

  4. Furthermore, there are a lot of people who don't even want feature A. That proves that the big project is bloated.

  5. If a developer is a member of a big project, he is probably already using it in production, and he doesn't much care what some young, know-it-all kid says about his code. Ever wonder why Microsoft doesn't seem to care when people criticize it? They're too busy making money!

  6. However, if a developer is a member of a small project, he can afford to make fun of the big project. No one knows who he is, so they surely can't insult his work. He has security by obscurity!

  7. Furthermore, since so many people have worked on the large project, he can insult it vehemently without feeling morally responsible for insulting another person's hard work. It's like a shoplifter who shoplifts small items from large stores thinking the large store is too big to care.

  8. Let's suppose 1 out of every 10 projects succeeds. 9 of those projects will make claims that turn out to be false. However, since they don't succeed, no one remembers. However, the 10th project will make claims that turn out to be true. It has instant credibility. Hence, it is free to make claims, and many people won't even bother to verify or question those claims...at least until it becomes a big project and people start realizing that 9 out of 10 of its claims are actually false.

  9. If you only need to implement 1 feature, you can do so in code that is very simple and direct. Now, if you need to implement 10 features, there is bound to be some duplication. Hence, you can either a) live with the duplication, or b) refactor. If you live with the duplication, your project will be plagued with bugs that need to be fixed in multiple places. (Don't repeat yourself!) However, if you refactor the code, you'll end up with code that is (necessarily) more complex than when you only needed to implement 1 feature. Younger coders may not even be able to understand the code at all. Hence, they'll just call it stupid, bloated, and overly complicated.

  10. If a project is successful, it'll make it into production. Furthermore, people will need new features in the product. In implementing those new features, it may be necessary to refactor. When you refactor, you may need to decide whether to a) keep the existing API, b) re-write the API, c) create a compatibility layer. If you keep the existing API, you'll have to somehow "tack on" the additional functionality within that API. This may result in a hideous, unintuitive API. If you rewrite the API, you'll break everyone's code. If you provide a compatibility layer, you'll end up with twice as many APIs you need to support. Hence, implementing new features is the fastest way to end up with legacy cruft!

If you know more reasons why big projects suck, post them below! :-D

Monday, June 25, 2007

Random Comments from Google Developer Day

I went to Google Developer Day. Yeah, yeah, I know, that was weeks ago, and I'm only finally blogging about it now. Better late than never! Here are some random, sparse comments:

Keynote

There were about 1500-5000 developers world wide attending this event. A ton of APIs were launched in 2006. He mentioned Yahoo Pipes. Google Mashup Editor is a mashup of mashups. I felt pretty overwhelmed pretty quickly. Gears is about offline access for Web apps. It supports all major browsers and all major platforms. It's pretty weird to see SQL in JavaScript. It's based on SQLObject. There is a managed "sync" process. Google Reader will soon work offline. They're working closely with Adobe (e.g. Apollo). It was weird to hear the Adobe guy say, "Works on Linux". Sergey has a great sense of humor.

Gears Talk

You can configure a set of URLs for it to capture for use offline. This stuff is stored in a place separate of the normal browser cache. I saw a bit of code, "rs.getFieldByName('name')". Ugh! Why must people force JavaScript to look like Java? Don't they know that they could do "rs.name"? They implemented a worker pool so you can run JavaScript in the background. These are processes, so they don't have shared state. You can pass code as strings between the processes. They're adding full text search to SQLite.

Google Infrastructure Talk

Google was still at Stanford in '97. In their current design for servers, they went back to not using cases for the servers. They're still using low end hardware. Note that GFS is not at the kernel level. They have 200+ clusters. MapReduce is not used for user search. It's more for heavy duty tasks like indexing. BigTable is pretty amazing. It's a distributed, multi-dimensional, sparse map. They have fine-grained load balancing and fast recovery. They have distributed locks and a locking service. Their largest [BigTable?] is 3000TB on several thousand machines. I asked, and he said that open sourcing GFS "isn't unthinkable".

Google Web Toolkit

Ajax lets the server be stateless. The Java IDE they're using works with Google Web Toolkit even though the Java is being compiled down to JavaScript. Even setting breakpoints works, although to do this they're using a "hosted Web browser". A big benefit of GWT is that the IDE's refactoring support can be used on code that is getting compiled to JavaScript. All the compiling is done behind the scenes, so it feels more like editing a scripting language than editing a compiled language. In general, they prefer functionality over "bling". Hence, they prefer native UI elements rather than recreating all elements from scratch. GWT does the right thing with browser history. They use property files for I18N. They can catch errors in the property files as compile time. They do have nice advanced widgets. Font size changes are handled gracefully. The functional demos were pretty impressive. GWT takes care of managing the image cache really nicely. The compiler only puts in the JS libraries you actually need. Cute quote: "Even though it's open source, we decided to document it." If you're using GWT, you don't need to be an expert in browser quirks, you just need to know Java. GWT supports inline JavaScript.

Alex Martelli's Design Patterns in Python Talk

This is the third time I've seen this talk, and this time I was able to understand everything he said ;)

Theorizing the Data: Avoiding the Capital Mistake

This was a great talk about statistical approaches to linguistics. Probability stats papers were really big at the ACM in 2006. Everyone is fighting the spam problem. The speaker emphasized that more data results in better results, which is why he went to Google. Lots of data results in good machine learning which results in more useable language translations. In trying to do automated translations, nothing matter more than statistics. Getting hints from linguists wasn't all that helpful when they tried it. It would appear that humans may learn language by having a statistical understanding of patterns; after all, there are too many rules with too many exceptions.

Tuesday, June 19, 2007

Computer Science: Smart Code Reloading

How do you reload code at a per-module level? How do you deal with the data that the module might contain?

Reloading code on the fly is something that the original Lisp machines were famous for. Erlang/OTP is famous for this too. In my own project, Aquarium, which is a Python Web application framework, I use to do this trick as well.

In Python, reloading code is relatively easy (with a bunch of caveats having to do with import "graphs" and inheritance hierarchies). However, what do you do with the data? When you reload the module, the old data in that module is lost.

I've always wondered how the Lisp guys did it. How did they cope with changes in the data format? If you have a list of tuples of length 3, what happens if the new code expects a list of tuples of length 4?

In Rails land, they have database migration scripts. Hence, you specify the entire schema as an iterative set of changes to the database, starting from an empty database. You can also back out a migration if things don't work out.

I'm going to make a hypothesis. I suspect Erlang/OTP already does it this way using Mnesia, their in-process, distributed database. First of all, you don't keep any state at the module level. In true functional style, data is on the "stack" (although how the language is implemented is something else). Data that needs to survive a module reload is stored in an in-process "database". Note, I'm using the term "database" loosely, and I'm definitely not talking about SQL. To change the data format of the data stored in the "database", you write a migration. Hence, when you reload a module, you get new code, and you migrate the old data.

(Thanks go to Alex Jacobson and Mike Cheponis for many stimulating discussions.)

Monday, June 18, 2007

Linux: Ubuntu 7.04 on a Compaq Presario C500

I got Ubuntu 7.04 working on a new Compaq Presario C500 laptop. It's running really well, and it only cost me $479 :-D

The Ubuntu installer voluntarily resized sda1 (the primary partition) to 41595mb. I'm super impressed that it knows how to resize an NTFS partition! Hence, dual-booting Ubuntu and Windows Vista was really easy. By the way, I left sda2 alone. It's 5946mb, and it contains the Compaq restore image.

By the way, does anyone else feel that Compaq computers running Windows are simply an ad delivery mechanism? The default 512mb is scarcely usable in Vista. Fortunately, it's just fine under Ubuntu.

I setup wireless using ndiswrapper.

I kept hitting the touchpad with my thumb, which was messing me up when I was typing. Hence, I did:
  • apt-get install gsynaptics
  • Set "SHMConfig" to "true" in the touchpad section of /etc/X11/xorg.conf.
  • I restarted X.
  • I ran the touchpad preferences utility at System :: Preferences :: Touchpad. I disabled tapping.
  • The next time I logged in, the sensitivity was turned down all the way. Hence, I had to use the touchpad preferences utility to turn it up again.
To get the screen to work at the right resolution, I ran "apt-get install 915resolution" and rebooted.

I've noticed that suspend does not work, but hibernate does.

Anyway, I'm happy :-D

Thursday, June 14, 2007

Ruby: I'm on DZone Again!

Wahoo! I ended up on DZone again! This time, it was for Ruby: A Python Programmer's Perspective. I'm pretty excited that I've made it onto DZone using multiple different languages! :-D

Wednesday, June 13, 2007

Ruby: A Python Programmer's Perspective

As a "language lawyer", it's fun to learn new languages and see how they differ in subtle ways. Here are some of the many ways Ruby is different from Python, etc. Most of these aren't necessarily good or bad, they're just different. Looking at the differences, it's fun to try to peek into the design decisions behind the languages. If you've noticed more interesting differences, post them below as comments!

In Ruby, Classes, modules, and constants must begin with an upper case letter. Actually, this reminds me of Haskell.

Ruby uses "end" instead of indentation. That's fine unless you're a Python programmer like me who keeps forgetting to type "end" ;)

Ruby doesn't have true keyword arguments like Python. Instead, if you pass ":symbol => value" pairs to a function, they get put into a single hash. Python can act like Ruby using the "**kargs" syntax, but Ruby cannot act like Python; it cannot explicitly declare which keyword arguments are acceptable in the function signature.

Ruby is like Perl in that the last value of a function is its implicit return value. In Python, if there isn't an explicit return statement, the return value is implicitly None.

Ruby does not make a distinction between expressions and statements like Python does. Hence, you can do:

  a = if 5 > 6
7
else
puts "hi"
end
This is like Scheme and ML.

Ruby is much more clever than Python at figuring out how to translate end of lines into statements. For instance, the following works in Ruby, but not in Python.

  a = 2 +
2
Python would require parenthesis.

I'm still trying to figure out the proper style for when you should use parenthesis in function calls and when you should leave them out in Ruby. The distinction is idiomatic.

Ruby's string interpolation syntax is '"foo #{2 + 2}"'. Python uses '"foo %s" % (2 + 2,)'.

The syntax for declaring a class method (what Java calls a static method) is strange, since Python uses "self" for something very different:

  def self.my_class_method
puts "hi"
end
I must admit that "@a" is easier to type than Python's "self.a" without any significant loss in readability.

Single quoted strings in Ruby are like single quoted strings in Perl or like raw strings in Python. They get less interpretation.

Instance variables are never directly accessible outside the class in Ruby, unlike in Python or even Java.

In Python, you may use a publically accessible member on day one and change it to a property on day two behind everyone else's back. In Ruby, you use an attribute on day one. Fortunately, the syntax is very convenient, "attr_accessor :name". This is much more succinct that explicit getters and setters in Java.

Ruby has protected and private members, which Python and Perl purposely chose to leave out. A private member in Ruby is even more private that a private member in Java. If you have two instances of the same class, in Java one instance can access the other instance's private members, but that's not true in Ruby.

Ruby uses modules as mixins instead of using multiple inheritance to support mixins.

Ruby embraces what Python calls "monkey patching".

Python programmers generally try to avoid using eval, but I don't think that's the case in Ruby.

Ruby uses to "items << item" to append to a list. Python uses "items.append(item). PHP uses "items[] = item". This is one place where every language does it differently.

Ruby has "%w { foo bar bat }", which is like Perl's "q( foo bar bat )". Python doesn't have a construct for this, but you can use "'foo bar bat'.split()".

Ruby makes heavy use of symbols, like Scheme and Erlang (which calls them atoms). Python doesn't have symbols, so strings are used instead.

Ruby uses "elsif", whereas Python uses "elif".

Ruby doesn't need a colon in control structures, but it does require an end of line or a semicolon. Hence, to do a one liner in Python, it's:

  if 11 > 10: print "yep"
whereas in Ruby it's:
  if 11 > 10; puts "yep"; end
As everyone knows, Ruby supports blocks. Personally the use of "|a, b|" to denote arguments to the block seems really strange to me. Who uses's pipes for arguments? They don't even pair like parenthesis do!

In Python, there's a much stronger emphasis on passing functions. I'm sure that it's possible in Ruby, but it's more natural to pass a block instead.

Ruby has a very different syntax for exceptions handling than most languages:

  begin
a = some_func
rescue FuncFailed
puts "I'm hosed!"
end
When you unmarshal an object in Ruby, all the class definitions have to be loaded already. Python will import them for you, assuming they can be imported.

Ruby allows function names like "empty!" and "empty?" which is clearly a matter of taste, but I like it. This is probably inspired by Scheme.

For some reason, it seems like using "help(String)" in Ruby is pretty slow, whereas using "help(str)" is pretty fast. I wonder if Ruby doesn't have the docstrings attached to the object at runtime like Python does. In Python, this stuff is always loaded unless you use "-00" for optimization.

The rest of these comments are inspired by: http://books.rubyveil.com/books/ThingsNewcomersShouldKnow

What Ruby calls "'foo'[0]", Python calls "ord('foo'[0])". What Python calls "'foo'[0]", Ruby calls "'foo'[0,1]" or "'foo'[0].chr".

In Ruby, a failed lookup in a hash returns a default value, which is usually nil. You can set a different default if you want. Python will raise an exception if you try to access a key that doesn't exist. However, in Python 2.5, you can now set a default value for the dict. Now I know where they got that idea from ;)

In Ruby, you say "(hash[key] ||= []) << value", whereas in Python, you say "hash.setdefault(key, []).append(value)."

In Python, it's "len(obj)" (i.e. len is a generic function). In Ruby, it's "obj.length" (i.e. polymorphism is used). This difference seems to happen a lot.

In Ruby, strings are mutable. Hence, "s.upcase" returns a upcase version of s, whereas "s.upcase!" actually modifies s. Python strings are immutable.

Ruby doesn't have tuples (i.e. immutable arrays).

Because Ruby doesn't have as strong a notion of immutable objects as Python does. For instance, you may use mutable objects as hash keys in Ruby. Python forbids this. If you do change the value of a key in Ruby, you may want to let the hash recalculate the hash values for all the keys via "my_hash.rehash".

Ruby will let you assign a new value to a variable defined outside your scope:

  i = 0
(0..2).each do |i|
puts "inside block: i = #{i}"
end
puts "outside block: i = #{i}" # -> 'outside block: i = 2'
This was not previously possible in Python without using a workaround. However, Python is gaining this feature with the use of a keyword similar to the "global" keyword.

Coming at it from a different angle, in Java, you use explicit variable declarations to assign scope. This is true too in Perl when you use "my". In Python, an assignment automatically sets scope. Hence, you can shadow variables in Java, Python, and Perl. Ruby tries really hard to avoid shadowing. Hence, whoever assigns to the variable first sets the scope. Shadowing can still happen in rare cases (see here), but it's a lot less likely.

Ruby assignments are expressions. Hence, you can do:

  while line = gets
puts line
end
Python purposely left this out because it's too easy to confuse "=" and "==". Hence, in Python you would write:
  for line in sys.stdin:
print line
Ruby has two sets of logical operators. They have different precedences. Hence, "a = b && c" means "a = (b && c)", whereas "a = b and c" means "(a = b) and c". I'm going to agree with Guido on this one and say this is just too confusing.

Ruby has an === operator and case statements. This feature is a lot closer to the match feature in ML languages than anything in Python:

  case my_var
when MyClass
puts "my_var is an instance of MyClass"
when /foo/
puts "my_var matches the regex"
end
This is really neat. Notice that, thankfully, Ruby doesn't require "break" like all the C inspired languages.

In Ruby, only false and nil are considered as false in a Boolean expression. In particular, 0 (zero), "" or '' (empty string), [] (empty array), and {} (empty hash) are all considered as true.

In Python, 0, "", '', [], and {} are all considered False. In general, Python has a very rich notion of "truthiness", and you can define the "truthiness" of your own objects.

In Ruby, if s is a string, you may write "s += 'a'" or "s << 'a'". The first creates a new object. The second modifies s. If you modify s, that may "surprise" other pieces of code that also have a reference to s. Python strings are simply immutable so this can't happen.

Ok, that's all for now! In my opinion, they're both great languages. If you know about more fun differences, post them below!

Suggested Reading: The Futures of Ruby Threading

This is a great little article that touches on threading in Ruby, Python, Java, and Erlang:

The Futures of Ruby Threading