Wednesday, October 29, 2008

Ruby: An Interesting Block Pattern

Ruby has blocks, which enable all sorts of interesting idioms. I'm going to show one that will be familiar to Rails enthusiasts, but was new to me.

I was reading some code in a book, and it had the following:
def if_found(obj)
if obj
yield
else
render :text => "Not found.", :status => "404 Not Found"
false
end
end
Here's how you call it:
if_found(obj) do
# We have a valid obj. Render something with it.
end
The code in the block will only execute if the obj was found. If it wasn't found, the response will already have been taken care of.

I've been in the same situation in Python (using Pylons), and I coded something like:
def handle_not_found(obj):
if not obj:
return render_404_page()
return None
Here's how you call it:
response = handle_not_found(obj)
if response:
return response
# Otherwise, continue normally.
Pylons likes to return responses, whereas render in Ruby works as a side effect whose return value isn't important. However, that's not my point.

My point is that the Python code uses "if response:" whereas the Ruby code uses "if_found(obj) do". Python uses an explicit if statement, whereas Ruby hides the actual if statement in a block. Similarly, Rubyists tend to write "my_list.each do |i|..." (even though Ruby has a for statement), whereas Pythonistas use "for i in my_list".

Ok, now that I've totally made a mountain out of a molehill, please note that I'm not saying either is better than the other. I'm just saying it's interesting to note the difference.

Tuesday, October 28, 2008

Python: Some Notes on lxml

I wrote a webcrawler that uses lxml, XPath, and Beautiful Soup to easily pull data from a set of poorly formatted Web pages. In summary, it works, and I'm quite happy :)

The script needs to pull data from hundreds of Web pages, but not millions, so I opted to use threads. The script actually takes the list of things to look for as a set of XPath expressions on the command line, which makes it super flexible. Let me give you some hints for the parts that I found difficult.

First of all, here's how to install it. If you're using Ubuntu, then:
apt-get install libxslt1-dev libxml2-dev
# I also have python-dev, build-essentials, etc. installed.
easy_install lxml
easy_install BeautifulSoup
If you're using MacPorts, do
port install py25-lxml
easy_install BeautifulSoup
The FAQ states that if you use MacPorts, you may encounter difficulties because you will have multiple versions of libxml and libxslt installed. For instance, the following may segfault:
python -c "import webbrowser; from lxml import etree, html"
Whereas the following shouldn't:
env DYLD_LIBRARY_PATH=/opt/local/lib \
python -c "import webbrowser; from lxml import etree, html"
You also have to be careful of thread safety issues. I was sharing an effectively read-only instance of the etree.XPath class between multiple threads, but that ended up causing bus errors. Ah, the joys of extensions written in C! It's a good reminder that the safest way to do multithreaded programming is to have each thread live in its own process ;)

lxml permits access to regular expressions from within XPath expressions. That's super useful. I had a hard time getting it working though. I forgot to pass in the right XML namespace in one part of the code. For some reason, I wasn't getting an error message. (As a general rule, I love it when software fails fast and complains loudly when I do something stupid.) Furthermore, my knowledge of XSLT was weak enough that I had a really hard time figuring out how to combine the XPath expression with the regex. Anyway, here's how to create an etree.XPath instance containing a regex:
from lxml import etree
XPATH_NAMESPACES = dict(re='http://exslt.org/regular-expressions')
xpath = etree.XPath("re:replace(//title/text(), 'From', '', 'To')",
namespaces=XPATH_NAMESPACES)
match = xpath(tree)
Anyway, lxml is frickin' awesome, and so is BeautifulSoup. Together, I can take really, really crappy HTML, and access it seemlessly.

Friday, October 17, 2008

Python: Permission denied: '/var/www/.python-eggs'

I have a Pylons app, and I got the following exception in my logs:
The following error occurred while trying to extract file(s) to the Python egg
cache:

[Errno 13] Permission denied: '/var/www/.python-eggs'

The Python egg cache directory is currently set to:

/var/www/.python-eggs

Perhaps your account does not have write access to this directory? You can
change the cache directory by setting the PYTHON_EGG_CACHE environment
variable to point to an accessible directory.
The problem is that the app was running as www-data (which was the user created for nginx and Apache). www-data's home directory is /var/www, but it doesn't have write access to it. (I'm afraid of allowing write access so that it can unpack eggs into that directory because that directory is the web root. In general, you should be careful of what you put in the web root.)

There are a few ways to address this problem. One is to make sure to always use --always-unzip when installing eggs. Another is to create a place for www-data to store its eggs by either changing its home directory or by setting the environmental variable PYTHON_EGG_CACHE.

I decided the simplest thing to do was to simply create a new user with a proper home directory.
adduser myapp  # Used a throwaway password.
vipw # Set the shell to /bin/false.
Once I did that, I updated the app to run as the myapp user and made sure it had access to all the directories it needed.

Trac requires its own user. I figure it's reasonable for my app to have its own user too.

Wednesday, October 15, 2008

Web: Flock

I've been using Flock for a few months now, and I finally noticed that it's not open source. Time for me to switch to Firefox 3 ;)