Friday, March 23, 2012

PyCon: Parsing Sentences with the OTHER Natural Language Tool: Link Grammar

See the website.

When it comes to NLTK (the Natural Language Toolkit), some assembly is definitely required. If you're not a linguist, it's not so easy.

Link Grammar is a theory of parsing sentences. It is also a specific implementation written in C. It's from CMU. It started in 1991. The latest release was in 2004.

Link Grammar does not use "explicit constituents" (e.g. noun phrases).

It puts an emphasis on lexicality.

Sometimes, specific words have a large and important meaning in the language. For instance, consider the word "on" in "I went to work on Friday."

pylinkgrammar is a Python wrapper for Link Grammar. (Make sure you use the version of pylinkgrammar on BitBucket.)

Often, there are multiple valid linkages for a specific sentence.

It can produce a sentence tree. It can even generate Postscript containing the syntax tree. (The demo was impressive.)

A link grammar is a set of rules defining how words can be linked together to form sentences.

Words have connectors to the left and/or the right.

General rules:
  • Connectivity: All words must be connected (with some exceptions).
  • Exclusion: No two links can connect the same pair of words.
  • Planarity: If links are drawn above words, no two links can cross.
  • Ordering: When satisfying multiple connectors on one side of a disjunct, the ordering of the words must match the ordering of the connectors.
The software does some post processing in order to handle a lot of things.

The system has knowledge of 60,000 words and 170 (?) link types.

It's amazing how it can recognize the invalidity of certain sentences that are only very subtly invalid.

It can guess the purpose of unknown words.

It's pretty robust against garbage thrown into the sentence.

It's used as the grammar checker in Abiword.

Link Grammar is a powerful theory for parsing sentences. It has a comprehensive implementation. It's easy to use with Python.

The distinction between syntax and semantics is pretty blurry.

PyCon: Building A Python-Based Search Engine

See the website. Here's the video.

Unfortunately, I showed up late for this talk and didn't take notes. However, it was one of the best talks! If you've ever wanted to build a search engine like Lucene from scratch, I highly recommend this talk!

PyCon: Parsing Horrible Things with Python

See the website.

He's trying to parse MediaWiki text. MediaWiki is based on lots of regex replacements. It doesn't have a proper parser.

He's doing this for the Mozilla wiki.

He tried Pyparsing. (Looking at it, I think I like PLY better, syntactically at least.) He had problems with debugging. Pyparsing is a recursive decent parser.

He tried PLY. He really likes it. It is LALR or LR(1). PLY has stood the test of time, and it has great debugging output.

However, it turns out that MediaWiki's syntax is a bit too sloppy for PLY.

LALR or LR(1) just doesn't work for MediaWiki.

Next, he tried Pijnu. It supports PEG, partial expression grammars. He got it to parse MediaWiki. However, it has no tests, it's not written Pythonicly, it's way too slow, and it eats up a ton of RAM!

He wrote his own parser called Parsimonious. His goals were to make it fast, short, frugal on RAM usage, minimalistic, understandable, idiomatic, well tested, and readable. He wanted to separate parsing from doing something with the parse tree. Hence, Parsimonious's goal is to produce an AST. Parsimonious is very readable.

If you optimize a bit of code, write a test to test that the optimization technique keeps working.

In Parsimonious, you put all the parsing rules in one triple quoted string.

It has nice syntax.

See github.com/erikrose/mediawiki-parser/blob/master/parsers.rst for his notes on a ton of different parsers.

Proper error handling is a master's thesis in itself.

There are Python bindings for Antlr. However, he didn't want a separate code generation step, and it's slow.

You can do interesting things with sys.getframe and dis.dis.

PyCon: Keynote: Guido Van Rossum

Take the survey at svy.mk/pycon2012survey.

Videos are already being uploaded to pyvideo.org.

Guido is wearing a "Python is for girls" T-shirt.

He showed a Python logo made out of foam on a latte. Yuko Honda really did it.

He had some comments about trolls.

Troll, n:
  • A leading question whose main purpose is to provoke an argument that cannot be won.
  • A person who asks such questions.
Guido said that Python, Ruby, and Perl are exactly the same language from 10,000 feet. They should be our friends. Lisp is not all that different from Python, although you might not notice it. Ruby is not kicking Python's butt in any particular case.

They added unicode literals back to Python 3.3 just to make compatibility with Python 2.7 easier.

"I should have used a real presentation tool instead of Google Docs." (Doh! He's a Googler!)

More and more projects are being ported to Python 3.

People are actually capable of using one code base for Python 2.7 and Python 3.3. There's a helpful library called "Six".

The Python 3 migration is painful now, but it's going to be worth it.

If PyPy is faster than cPython, why not abandon cPython? No one is using PyPy in production. PyPy is still hard for people writing C extensions. cPython is still a great glue system (for Fortran, C, and random binary code). "PyPy still has a long way to go. Speed is not everything." PyPy and cPython will live together in slightly different contexts for years. cPython will remain the reference implementation.

People still criticize dynamic typing, saying it's not safe. "The type checking in compilers is actually really weak." For instance, most of the time, a type checker can't tell you if an int represents inches or centimeters. The only way to create good code is to test, audit, hire smart people, etc. Test! A type checking compiler can only find a narrow set of errors; no one has conquered the problem of writing perfect software.

A large program written in a statically typed language will often have a dynamically typed, interpreted subsystem, done via string manipulation, or something else.

Dynamic typing is not inherently inferior to static typing.

Guido hopes people are "doing interesting things with static analysis."

Guido pronounces GIL as "Jill".

"OS level threads are meant for parallel IO, not for parallel computation." Use separate processes for each core. One team was using Python on a machine with 64,000 cores. "You have to be David Beazley to come up with the counterexample."

Guido has complaints about each of the async libraries, but he doesn't have a solution that he recommends. It's not a problem he has to deal with. gevent is interesting. He can't add stackless to Python because it would make other implementations much more complicated or impossible. He doesn't want to put an event loop in the stdlib because not everyone needs it. Some things are better developed outside of Python. He's not a fan of callback-style approaches; he would prefer the gevent approach. He likes synchronous looking code that under the hood does async, without using threads. All these things deserve more attention.

We might want to make Python 4 the new language to solve the concurrent programming problem. He wants to see lots of alternatives.

Introducting a new language to replace JavaScript is such a hard, political problem.

He was told running Python on Android is "strategically not important." He was asked not to work on it anymore. Other people are going to have to do it.

"I don't think Python is a functional programming language. map, filter, lambda are not enough to make a functional programming language. Python is very pragmatic. It lets you mess around with state, fiddle with the OS, etc. Those things are really hard in a beautiful language like Haskell. When you have to do messy IO in Haskell, all the messiness becomes even messier.

Learn a functional language, get excited about the ideas, and let it impact your Python code. You don't have to do everything with a functional style.

"From data to data" really fits the functional paradigm. You can use that in Python.

He likes functional programming as a challenge and as an idea. However, 30 years from now, we won't all be writing functional programs all the time.

You're much better off not being in the stdlib. If you have a broken API, you can never fix it. Python has a very slow release schedule. It's much better to be third-party. Don't feel like you're not successful unless you're in the stdlib. It's better to own your own release cycle. Things in the stdlib should be things that enable everything else; like a context manager.

Python has reference counting and a garbage collector that resolves cycles, etc. PyPy is the place to play with better memory allocators. "I'm very happy with cPython's memory allocator."

Let's set our expectations low about adding new features to the language since we just finished Python 3. If you really have a desperate need for an extension to the language, think about using a preprocessor [my term] of some sort. Import hooks can let you do the transformation automatically.

PyCon: Python for Data Lovers: Explore It, Analyze It, Map It

See the website.

I missed the beginning of this talk, and since I'm not a data lover, I'm afraid my notes may not do it justice.

There is lots of interesting, "open data."

There is a lot of data that is released by cities.

She's a geographer and obviously a real data lover. She gets excited about all this data.

csvkit is an amazing set of utilities for working with CSV. It replaces the csv module.

"Social network analysis is focused on uncovering the patterning of people's interactions."

They used QGIS.

She relies heavily on Google Refine.

PySAL is really great for spatial analysis.

She recommended "Social Network Analysis for Startups" from O'Reilly. Her advisor wrote it.

PyCon: Storm: the Hadoop of Realtime Stream Processing

See the website.

"Storm: Keeping it Real(time)."

Storm is from dotCloud which is a platform to scale web apps.

They're in the MEGA-DATA zone.

They were using RRD.

Storm is real-time, computation framework.

It can do distributed RPC and stream processing.

It focuses on continuous computation, such as counting all the things passing by on a stream.

Storm does for real-time what Hadoop does for batch processing.

It is a high-volume, distributed, horizontally scalable, continuous system.

Even if the control layer goes down, computation can keep going.

It's strategy for handling failures is to die and recover quickly.

It is fault tolerant, but not fault proof.

Data is processed at least once. With more work and massaging, they have support for "exactly once".

Storm does not handle persistence.

If failures happen, it resubmits stuff through the system.

It doesn't process batches reliability.

It complements Hadoop, but does not attempt to replace Hadoop.

It does not protect against human error.

He suggested that one day we'll use a mix of batching and streaming to get the benefits of both.

Storm has three core elements:
  1. Sprouts inject data into the system. This could be data from a queue. This could be data from the Twitter firehose.
  2. Streams are an unbounded sequence of storm tuples. These are like named tuples. All tuples of the same stream must have same "shape".
  3. Bolts take inputs and transform them to make output streams. 0 or more inputs produces 0 or more outputs. Most of the computation happens here.
Sprouts and bolts can both produce multiple output streams.

A topology is a set of sprouts and bolts connected by streams.

This is a higher-level abstraction than message passing.

All of this is done over 0mq. It uses ZooKeeper for discovery. Storm simplifies all of this.

They're handling 10k-100k requests per second at their company.

Storm is JVM based. It's a 50/50 mix of Java and Clojure. It has a multilingual API. Script bolts can be written using a thin shell that shells out to Python.

"Umbrella" protects you from the Storm. It lets you use Storm pretty much Java free.

He's using nested classes for declarative programming.

He deployed it on DOTCLOUD.

A storm topology can even have cycles.

PyCon: Pragmatic Unicode, or, How do I stop the pain?

See the website.

See the slides.

This was one of the best talks. The room was packed. This is the best unicode talk I've ever been to!

Computers deal with bytes: files, networks, everything. We assign meaning to bytes using convention. The first such convention was ASCII.

The world needs more than 256 symbols. Character codes were a system that mapped single bytes to characters. However, this still limited us to 256 symbols.

EBCDIC, APO, BBG, OMG, WTF!

Then, they tried two bytes.

Finally, they came up with Unicode.

Unicode assigns characters to code points. There are 1.1 million code points. Only 110k are assigned at this point. All major writing systems have been covered.

"Klingon is not in Unicode. I can explain later."

Unicode has many funny symbols, like a snowman and a pile of poo.

"U+2602 UMBRELLA" is a Unicode character.

Encodings map unicode code points to bytes.

UTF-16, UTF_32, UCS-2, UCS-4, UTF-8 are all encodings.

UTF-8 is the king of encodings. It uses a variable number of bytes per character; hence it's a variable length encoding. ASCII characters are still only one byte in UTF-8.

No Unicode code point needs more than 4 UTF-8 bytes.

Python 2 and Python 3 are radically different.
In Python2
A str is a sequence of bytes, like 'foo'. A unicode object is a sequence of code points, like u'foo'.

bytes != code points!

unicode.encode() returns bytes. bytes.decode() returns a unicode object.
my_unicode = u"Hi \u2119"
my_utf8 = my_unicode.encode('utf-8')
my_unicode = my_utf8.decode('utf-8')
Many encodings only support a subset of unicode. For instance, .encode('ascii') will fail for characters out of the range(128).

Random byte streams cannot successfully be decoded as UTF-8. This is a feature that tells you when you're doing something wrong (i.e. decoding something that isn't actually UTF-8).

If there are errors in the encoded bytes, you can handle errors in multiple ways. For instance, you can replace the erroneous bytes with "?" by using "my_unicode.replace('ascii', 'replace')". There are other approaches available as well. See the second argument to the replace method.

Python 2 tries to implicitly do conversions when you mix bytes and unicode. This is based on sys.getdefaultencoding().

This helpfulness is very helpful when everything is ASCII. If it isn't, it's PAINFUL!

You have both bytes and unicode, and you need to keep them straight.
In Python3
The biggest change in Python 3, and the one that causes the most pain is unicode.

A str is a sequence of code points (i.e. Unicode), such as "Hi \u2119".

A bytes object is a sequence of bytes, such as b"foo".

Python 3 does not provide any automatic conversion between bytes and (unicode) strs.

Mixing bytes and (unicode) strs is always painful in Python 3. You are forced to keep them straight. The pain is much more immediate.

The data you get from files depends on how you open it. For instance, if you use "open('f', 'rb')", you'll get bytes because of the 'b'. You can also pass an encoding argument.

See: locale.getpreferredencoding()

stdin and stdout are preopened file handles. That can complicate things.
Relieving the Pain
Think of your program as a Unicode sandwich. It should use bytes on the outside and unicode objects on the inside. It should encode and decode at the edges. Beware that libraries might be doing the encoding and decoding for you.

Know what you have. Is it bytes or unicode? If it's bytes, what's the encoding?

Encoding is out-of-band. You must be told what encoding the bytes have. You cannot infer. You must be told, or you have to guess (which may not always work).

Data is dirty. Sometimes you get data and you're wrong about what encoding it's in.

Test unicode. Get some exotic text. Upside down text is awesome.

There are a lot more details such as BOMs (byte order marks).

Cherokee was recently given a writing system.

Japanese people have a large set of emoticons, and they were added to Unicode.

u+1f47d is an alien emoticon.

He showed a bunch of cool Unicode glyphs.

It's better to not have an implicit encoding. It's better to always be explicit.

Don't mess with the default system encoding!

A BOM shouldn't appear in UTF-8 at all since it is unnecessary. However, Python has an encoding that basically says, "this is UTF-8, but ignore the BOMs".

The unicodedata module has a remarkable amount of information.

The implicit encoding and decoding present in Python 2 doesn't even exist in Python 3.

In Python 2, there's an "io" module that knows how to open files with the correct encoding.

When piping to a file, stdout defaults to UTF-8. When outputting to terminal, stdout uses the terminal encoding.

PyCon: How the PyPy JIT Works

See the website.

"If the implementation is hard to explain, it's a bad idea." (Except PyPy!)

The JIT is interpreter agnostic.

It's a tracing JIT. They compile only the code that's run repeatedly through the interpreter.

They have to remove all the indirection that's there because it's a dynamic language.

They try to optimize simple, idiomatic Python. That is not an easy talk.

(The room is packed. I guess people were pretty excited about David Beazley's keynote.)

There's a metainterpreter. It traces through function calls, flattening the loop.

JIT compiler optimizations are different than compiler optimizations. You're limited by speed. You have to do the optimizations fast.

If objects are allocated in a loop and they don't escape the loop, they don't need to use the heap and they can remove boxing.

They do unrolling to take out the loop invariants.

They have a JIT viewer.

Generating assembly is surprisingly easy. They use a linear register allocator. The GC has to be informed of dynamic allocations.

They use guards that must be true to continue using the JITted code. I.e., did the code raise an exception?

They have data structures optimized for the JIT such as map dicts.

They can translate attribute access to an array in certain cases.

The JIT is generated from an RPython description of the interpreter.

The metainterpreter traces hot loops and functions.

They use optimizations that remove indirection.

They adapt to new runtime information with bridges.

They added stackless support to the JIT.

They want the JIT to help with STM.

They have Prolog and Scheme interpreters written on top of the PyPy infrastructure.

They don't do much with trying to take advantage of specific CPUs.

PyCon: Why PyPy by Example

See the website.

PyPy is a fast, open source Python VM.

It's a 9 year old project.

PyPy is not a silver bullet.

For speed comparisons, see speed.pypy.org.

PyPy is X times faster than cPython. If it's not faster than cPython, it's a bug.

Hard code number crunching in a loop is much, much faster in PyPy.

(When I think about PyPy, V8, and all the various versions of Ruby, it makes me think that it's an amazing time for VMs!)

If you think of the history of software engineering, GC was hard to get right, but now it's mostly done. Now we talk about how to use multiple cores. It's a mess with locks, semaphores, events, etc. However, one day, using multiple cores will be something that is somewhat automatic like GC is.

He said nice things about transactional memory. It promises to give multicore usage. It has hard integration issues just like GC did. His solution is to run everything in transactional memory. I.e. let the decision about when to use transactional memory be pushed down into the underlying system.

He's using PyPy to implement STM (software transactional memory).

Concerning PyPy, "The code base might be hairy, but we are friendly."

They used TDD to write PyPy.

They spent 8 years building infrastructure.

ctypes works in PyPy.

PyPy operates from the level of bytecode down.

"Do you have to look like a Haskell programmer to work on STM?"

Haskell's STM is very manual. PyPy's is automatic.

"IRC is better than documentation."

I think this is the year of PyPy ;)

PyCon: Let's Talk About ????

David Beazley gave the keynote on the second day of PyCon. He decided to talk about PyPy.

PyPy made his code run 34x faster, without changing anything.

In theory, it's easier to add new features to Python using PyPy than cPython.

He's been tinkering with PyPy lately.

IPython Notebook is cool.

Is PyPy's implementation only for evil geniuses?

PyPy scares him because there is a lot of advanced computer science inside.

He doesn't know if you can mess around with PyPy.

It takes a few hours to build PyPy.

It needs more than 4G of RAM.

PyPy translates RPython to C. It generates 10.4 million lines of C code!

PyPy is implemented in RPython, which is a restricted subset of Python.

"RPython is [defined to be] everything that our translation toolchain can accept."

The PyPy docs are hard to read.

4513 .py files, 1.25 million non-blank lines of Python.

translate.py convers RPython code to C.

The PyPy version is faster than the C version of Fibonacci! Although, if you turn on C optimizations, they're similar.

RPython is a restricted subset of Python that they used to implement the Python interpreter.

RPython can talk to C code. It's similar to ctypes.

RPython has static typing via type inference.

RPython has to think of the whole program and do type inferencing.

The implementation will blow your mind. It has "snakes and the souls of Ph.D students on the inside."

PyPy doesn't parse your Python. It uses Python code objects.

PyPy has a Python bytecode interpreter.

PyPy translates itself to C using its own bytecode interpreter.

They have regular Python and RPython in the same modules. They have the same syntax, but different semantics. Sometimes, they add docstrings with "NOT_RPYTHON" in them to keep track of which is which.

Stuff that happens at import time is normal Python. Code reached by the entry function is RPython.

They have a foreign function interface and something that's like autoconf.

They use decorators a lot.

"I still don't know how PyPy works."

"I don't even know how CPython works."

He does know how to use the things that make CPython work (ANSI C, Makefiles, etc.).

PyPy has a different set of tools: RPython, translate.py, metaprogramming, FFI.

Ruby is 3600x slower than Python on message-passing with a CPU-bound thread. They had a more extreme case of the same problem Python 3.3 had.

Ruby has a GIL.

He felt completely beat up and out of his league looking at the PyPy source. (I feel better now.)

Can you tinker with PyPy? He still doesn't know. He recommends that you do it anyway.

PyCon: Welcome Message on the Second Day

There were 2300 people at PyCon.

180 people came to the PyCon 5K race. There were 5 people who finished in under 20 minutes.

Steve Holden is the current chairman of the Python Software Foundation. However, he's letting someone else take over. He kind of gave up on OSS before coming to Python, but has since changed his mind.

There was still a tremendous gender imbalance at PyCon, but there were a lot more women this year. There was one or more women in every row when I looked around.

Yesterday, the keynote had dancing robots. You can control them with Python.

PyCon: Lightning Talks

Numba is a Python compiler for NumPy and SciPy. It replaces byte-code on the stack with simple type-inferencing. It translates to LLVM. The code then gets inserted into the NumPy runtime. They use LLVM-PY. They have a @numba.compile decorator. It's from Continuum Analytics.

IHasAMoney.com is a replacement for mint.com. He doesn't trust mint.com. IHasAMoney.com does not require the use of a mouse--it's for hackers. You can run it locally so that you don't have to give another web site your bank passwords.

Why do so many talks fall flat? Your talk should tell a story. People are story tellers. People care about people. Show puzzles, not solutions. Hacking is a skill, not a piece of knowledge.

He was measuring the Python 3 support for packages on PyPI. 54-58% of the top 50 projects on PyPI support Python 3. We planned on moving to Python 3 over the course of 5 years. We're at year 3. Update your Trope classifiers to say that your project supports Python 3.

He got Python working on an iBook. This is helpful for eBooks. He used Emscripten to compile CPython into JavaScript. This does not require jail breaking. See bit.ly/pyonbook (?).

PyCon 2014 and 2015 will be in Montreal. You'll need a passport.

bpython is an interactive shell. It only works on UNIXy systems. It looks gorgeous! It has syntax highlighting. It shows you all the callables on an object. It even shows you the docstrings, etc. It looks like Curses. You can jump to the source easily. It has rewind. It looks like a curses-based Java IDE (in a good way). pip or easy_install bpython.

Rpclib makes it easy to expose your service using multiple protocols. You can specifies types for input and output arguments. You can expose your API using a WSGI-compliant server. It also works with SOAP. It can produce XML output. It can also generate HTML microformats.

Python 3.3 will be awesome. PEP 393 gets rid of UCS2 vs. UCS4. It uses a codepoint abstraction. It surpasses the Unicode support in other languages. We won't have any more surrogate pair problems. This makes us as good as Perl, which apparently has very good Unicode support. Python 3.3 also unifies IOError and OSError. There is a new "yield from obj" syntax to flatten iterators (wahoo!). It also has a package module.

HUB is a wrapper around Git that makes working with GitHub easier. The only way you can install it is via Homebrew on a Mac. It lets you do lots of things on the command line that you would normally have to use the website for. You can use hub as an alias for git; it's a wrapper.

__init__ does not get re-run on unpickled objects. Hence, you can't add new members in new versions of __init__ because objects picked with the old version of __init__ will not get those members. However, __new__ is run. All pickled classes must be at module scope and consistently named. Only do one dump; don't use separate dumps because you might get multiple copies of subobjects.

Someone on the virtualenv team said, "I can't believe [virtualenv] even works at all." They're working on virtualenv 3, which they're hoping to get into Python 3.3. They want people to try it out. This is PEP 405.

PyPy gave a 10x speedup for generating fractals using his code. Shed Skin gave a 50x speedup, but only accepts a subset of Python. It compiles down to C++. But, using NumPy and Cython, he got 207x speedup (using multiple cores); using prange. NumPy gets behind the GIL. It takes about a day to learn how to do this stuff.

Hieroglyph is an extension for Sphinx which helps you write HTML5 slides from reStructured Text.

PyCon: Introspecting Running Python Processes

See the website.

What is your application doing?

Logging is your application's diary, but there are some drawbacks.

gdb-heap, eventlet's backdoor module, and Werkzeug's debugger are all useful tools.

These all have tradeoffs.

What's missing compared to the JVM? Look at JMX.

jconsole connects to a running JVM.

jstack sends a signal to the JVM to dump the stack of every thread.

You can expose metrics via JMX.

New Relic and Graphite are also useful.

New Relic does hosted web app monitoring.

Graphite is a scalable graphing system for time series data.

socketconsole is a tool that can provide stack trace dumps for Python processes. It even works with multi-processed and multi-threaded apps. It does not use UNIX signals.

mmstats is "the /proc filesystem for your application." It uses shared memory to expose data from an app. It has a simple API.

mmash is a web server that exposes stuff from mmstats.

He uses Nagios. He has pretty graphs.

See also:Projects used in this talk:See also: groups.google.com/group/python-introspection.

(There wasn't enough time for more than a few questions for each of the talks this year. That's really too bad because, often, the questions are really interesting.)

PyCon: Python Metaprogramming for Mad Scientists and Evil Geniuses

See the website.

This was one of the best talks.

Python is ideal for mad scientists (because it's cool) and evil geniuses (because it has practical applications).

Equipment:
  • Synthetic functions, classes, and modules
  • Monkey patching
  • sitecustomize.py
"Synthetic" means building something without the normally required Python source code.

Synthetic functions can be created using exec.

Synthetic classes can be created using type('name', (), d).

(exec and eval are very popular at PyCon this year. Three talks have shown good uses for them. I wonder if this is partially inspired by Ruby.)

Here's how to create a synthetic module:
import new
module = new.module(...)
sys.modules['name'] = module
Functions, classes, and modules are just objects in memory.

Patching third-party objects is more robust than patching third-party code.

You can use these tricks to implement Aspect-Oriented Programming.

(I wonder if it's possible to implement "call by name" using the dis module and messing with the caller's frame.)

You might need to synthesize a replacement module if you need to replace a module written in C.

Monkeypatching a class is tricker:
MyClass.spam = new.instancemethod(new_spam, None, MyClass)
...
obj.spam = new.instancemethod(new_spam, obj, MyClass)
The room was packed. This was a very popular talk.

This is how to "fix" executables:
  • Create a special version of sitecustomize.py containing your fix.
  • Create a wrapper script for the executable that sets PYTHONPATH to contain the dir containing sitecustomize.py.
People should use "#!/usr/bin/env python" so that the caller can control which Python it uses.

He monkeypatched __import__ so that code gets executed whenever someone tries to import something! Wow! I've never seen that trick before!

sitecustomize.py gets invoked really early. It gets imported so early, it can be an awkward environment to try to work in. For instance, sys.argv doesn't even exist yet.

He showed a bunch of good monkeypatching examples. For instance, he monkeypatched the pwd module.

It's okay if code breaks, as long as it breaks early and loudly.

You can shove anything in sys.modules as long as it responds to __getattr__. You can even synthesize behavior based on dynamic dispatch.

It's easy to get into a situation where things don't even make sense anymore for other people debugging your code.

If you're going to patch third-party code:
  • Do it seldom.
  • Do it publicly.
New releases of third-party code can still break a monkey patch.

More evil genius tools:Generating code is good because you'll get line numbers.

Using the ast module is another approach.

"new" is deprecated (in Python 3, I think). It's been replaced by the "types" module.

PyCon: Make Sure Your Programs Crash

See the website.

This talk was given by Moshe Zadka from VMware.

Think about how to crash and then recover from the crash.

If your application recovers quickly, stuff can crash and no one will see.

Even Python code occasionally crashes due to C bugs, untrapped exceptions, infinite loops, blocking calls, thread deadlocks, inconsistent resident state, etc. These things happen!

Recovery is important.

A system failure can usually be considered to be the result of two program errors. The second error is in the recovery routine.

When a program crashes, it leaves data that was written in an arbitrary program state.

Avoid storage: caches are better than master copies.

Databases are good at transactions and at recovering from crashes.

File rename is an atomic operation in modern OSs.

Think of efficient caches and reliable masters. Mark cache inconsistency.

He seems to be skeptical of the ACID nature of MySQL and PostgreSQL. I'm not sure why.

Don't write proper shutdown code. Always crash so that your crash code always gets tested. Your data should always be consistent.

Availability: if the data is consistent, just restart.

To get into the high 9s, recover very quickly. Limit impact, detect the crash quickly, and startup quickly.

Vertical splitting: different execution paths, different processes. Apache can have a child process die with no impact on availability.

Horizontal splitting: different code bases, different processes.

Watchdog: monitor -> flag -> remediate.

Watchdog principle: keep it simple, keep it safe.

A process can touch a file every 30 seconds. The watchdog sees whether the file has been touched.

The watchdog and the processor restarter should not be in the same process, because the watchdog should be simple. Remember: separation of concerns.

Mark problems. Check solutions. See if restarting worked.

Everything crashes: plan for it.

Linux has a watchdog daemon. Use that to watch your watchdog.

PyCon: Apache Cassandra and Python

See the website.

See the slides.

He doesn't cover setting up a production cluster.

Using a schema is optional.

Cassandra is like a combination of Dynamo from Amazon and BigTable from Google.

It uses timestamps for conflict resolution. The clients determine the time. There are other approaches to conflict resolution as well.

Data in Cassandra looks like a multi-level dict.

By default, Cassandra eats 1/2 of your RAM. You might want to change that ;)

He uses pycassa for his client. It's the simplest approach.

telephus is a Cassandra client for Twisted.

cassandra-dbapi2 is a Cassandra client that supports DBAPI2. It's based on Cassandra's new CQL interface.

Don't use pure Thrift to talk to Cassandra.

Cassandra is good about scaling up linearly.

There's a batch interface and a streaming interface.

There's a lot of flexibility concerning column families. You can even have columns representing different periods in time.

Pycassa supports different data types.

Pycassa has an interface that looks a little more like an ORM.

It has native indexes. However, indexes are not recommended for "high cardinality" values like timestamps or keywords.

PyCon: Code Generation in Python: Dismantling Jinja

See the website.

See also bit.ly/codegeneration.

Is eval evil? How does it impact security and performance?

Use repr to get something safe to pass to eval for a given type.

Eval code in a different namespace to keep namespaces clean.

Using code generation results in faster code than writing a custom interpreter in Python.

Here is a little "Eval 101".

Here is how to compile a string to a code object:
code = compile('a = 1 + 2', '', 'exec')
ns = {}
exec code in ns # exec code, ns in Python 3.
ns['a'] == 3
In Python 2.3 or later, use "ast.parse('a = 1 + 2')", and then pass the result to the compile function.

You can modify the ast (abstract syntax tree).

You can assign line numbers.

You don't have to pass strings to eval and exec. You can handle the compilation to bytecode explicitly. You can also execute the code in an explicit namespace.

Jinja mostly has Python semantics, but not exactly. It uses different scoping rules.

Lexer -> Parser -> Identifier Analyzer -> Code Generator -> Python Source -> Bytecode -> Runtime

Everything before the runtime can be done ahead of time and cached.

Because WSGI uses generators, Jinja also uses generators for output.

You can run untrusted template code with Jinja. They restrict what the Python can do. (I'm skeptical.)

They have automatic escaping.

In the art of code generation, you must think in terms of low level vs. high level.

(I got a little confused at this point about whether Jinja generated bytecode, ASTs, or Python source. Later in the talk, it seemed like he was saying that Jinja always generated Python code because it was the only workable option at the time.)

Using the ast module only became an option later.

He thought about generating bytecode. However, that doesn't work on Google App Engine. Furthermore, it was too implementation specific.

Using the ast module is more limited. However, it's easier to debug. Furthermore, it does not segfault the interpreter (at least starting in Python 2.7).

Using pure source code generation always works. However, it's very limited, and it's hard to debug without hacks.

The ast module is much better.

Jinja is way faster than Django templates.

Code running in a function is faster than running at global scope because local variable lookup is faster.

They keep track of identifiers and track them through the source code.

The context object in Jinja2 is a data source (read only). In Django, it's a data store (read write).

What happens in the include stays in the include. An include can't change a variable in an outer scope.

Jinja looks at your template and generates more complicated code if your code needs more complicated code.

{% for item in sequence %} creates item in a context that's only valid in the for loop.

Jinja used manual code generation because it was the only option. AST compilation is new in Python 2.6.

A Markup object wraps a string, but has autoescaping. It uses operator overloading. Jinja can do some escaping at compile time.

Undefined variables in Jinja are replaced by undefined objects so that they print out as empty strings. However, doing an attribute lookup on such an object raises an exception.

He would use the ast module if he had to do it all over again.

PyCon: Advanced Python Tutorials

I took Raymond Hettinger's Advanced Python I and II tutorials. These are my notes. See the website for more details: I and II.

Here's the source code for Python 2 and Python 3.

Raymond is the author of itertools, the set class, the key argument to the sort function, parts of the decimal module, etc.

He said nice things about "Python Essential Reference".

He said nice things about the library reference for Python. If you install Python, it'll get installed.

Read the docs for the built-in functions. It's time well-invested.

He likes Emacs and Idle. He uses a Mac.

Use the dis module to disassemble code. That's occasionally useful.

Use rlcompleter to add tab completion to the Python shell.

Use "python -m test.pystone" to test how fast your machine is.

Show "python -m turtle" to your kids.

Don't be afraid to look at the source code for a module.

He likes "itty", a tiny web framework.

The decimal module is 6000 lines long!

Idle has more stuff than I though, although I still think PyCharm is better.

He seems to use Idle to browse code and Emacs to edit code.

Use function.__name__ to get a function's name.

Use a bound method to save on method lookup in a tight loop. Notice the naming pattern:
s = []
s_append = s.append
He is very optimistic about PyPy. He thinks it'll become the defacto standard for Python use.

Here are some optimization tips:
  • Replace global lookups (and builtin lookups) by setting local aliases.
  • Use bound methods to avoid method lookups.
  • Minimize pure-python function calls inside a loop.
A new stack frame is created for every function call.

You should only need to use speedups like the above in a handful of places such as inner loops.

Listening to him explain how expensive even simple things in Python are makes me want to switch to Go ;)

Manually inline function calls in some cases.

Here's how to time code:
from timeit import Timer
print min(Timer(stmt, setup).repeat(7, 20))
"Loop invariant code motion" is an optimization technique where you move stuff outside the loop where possible.

"Vectorization" [according to him] means replacing CPython's eval-loop with a C function that does all the work for you. For instance, he suggests moving from list comprehensions to map where it makes sense.

Use multiprocessing.map to parallelize map.

He keeps plugging PyPy.

itertools.repeat(2, 100) repeats 2 over-and-over again, 100 times.

itertools.count(0) counts starting at 0.

In some cases, switching to itertools can get you most of the performance benefits that you might get by switching to C.

Here are the optimization techniques he covered: vectorize, localize, use bound methods, move loop invariants out of the loop, and reduce the number of Python function calls.

itertools now has new functions called permutations, combinations, and product (which gives you the cartesian product of two sequences).

You can use these to generate all the possible test cases given a set of states.

Think of itertools.product as the functional approach to nested for loops:
for t in product([0, 1], repeat=3): print t
is the same as:
for a in [0, 1]:
for b in [0, 1]:
for c in [0, 1]:
print (a, b, c)
Other useful things:

functools.partial()

collections.OrderedDict()

collections.Counter()

vars(foo) == foo.__dict__

Use dir(foo) to get the public API for foo.

sorted(vars(collections).keys()) is the same as dir(foo), but dir also removes the private methods.

"Everything in Python is based on dictionaries."

He said that Guido added OOP to Python in a weekend.

Raymond showed code that simulated classes using just functions and dicts.

"import antigravity" launches the famous XKCD cartoon on Python in a browser.

Use "webbrowser.open(url)" to open a URL in a browser.

ChainMap is a new tool in Python 3.3 to do a chain of lookups in a list of
dicts.

"I used to be a high frequency trader. I helped destroy the world's economy. Sorry 'bout that."

Using the collections.namedtuple module is a great way to improve the readability of code.

There are many useful, valid uses for exec and eval. He criticized people who think that exec and eval are universally evil.

collections.namedtuple is based on exec.

He showed Python code generation (i.e. generating code as a string and then passing it to exec). Using a piece of Python data that acts as a DSL, you can generate some Python code and pass it to exec. You can generate code for other programming languages just as well as Python.

He thinks that showing a little bit of code is better than letting people download slides.

Here's a trick: subclass an object, and add methods for all the double under methods in order to add logging. This lets you track how the method was used. You can use this to evaluate stuff symbolically instead of arithmetically. For instance, subclass int, add methods for things like __add__, and keep track of how __add__ was called.

Polymorphism and operator overloading let you create custom classes that do additional stuff that numbers can't.

He showed function dispatch, like:
getattr(self, 'do_' + cmd)(arg)
See the cmd module.

Python's grammar is in grammar/grammar in the source code.

He showed how PLY puts Lex and Yacc expressions in docstrings. I.e. PLY uses docstrings to hold a DSL that PLY understands.

He showed loops with else clauses.

Knuth was the one who first came up with the idea of adding something like an else clause to a loop.

The idea that you shouldn't return in the middle of a function is advice from days gone by that no longer makes sense.

The nice thing about the way Python intervals works is:
s[2:5] + s[5:8] == s[2:8]
Copy a list: c = s[:]

Clear a list: del s[:]

Another way to clear a list: s[:] = []

In Python 3.X, a copy method was added to the list class. They're also adding a clear method to lists, to match all the other collections.

You can use itemgetter and attrgetter for the key function when calling list.sort. There's also methodcaller.

Use locale.strxfrm for the key function when sorting strings for locale-aware sorting.

Sort has a keyword argument named reverse.

To sort with two keys, use two passes:
s.sort(key=attrgetter('lastname'))           # Secondary key
s.sort(key=attrgetter('age'), reverse=True) # Primary key
"deque" is pronounced "deck". It gives you O(1) appends and pops from both ends.

"deque" is a "double ended queue".

He also mentioned defaultdict, counter, and OrderedDict. counter is a dict that knows how to count.

Here's how to use a namedtuple:
namedtuple('Point', 'x y z')
p = Point(10, 20)
Here's how to use a defaultdict:
d = defaultdict(list)
d[k].append(v)
dict.__missing__ gets called if you lookup something that isn't in the dict. You can subclass dict and just add a __missing__ method.

idle has nice tab completion in the shell. It also has a nice menu item to lookup modules by name so you can find the source easily.

You can use __getattr__ to introduce tracing.

He pronounces "__missing__" as "dunder missing". "dunder" is an abbreviation for "double underscore".

Writing "d.x" implicitly calls __getattribute__ which works as follows:
Check the instance.
Check the class tree:
If it's a descriptor, invoke it.
Check for __getattr__:
If it exists, invoke it.
Otherwise, raise AttributeError.
OrderedDict is really helpful when you must remember the order. This helps if you're going to move to a dict temporarily and then want stuff to come back out in the same order that it went in.

Each of the methods in OrderedDict has the same big O as the respective methods in dict. (Presumably, the constants are different.)

Here is Raymond's documentation on descriptors.

Here's a descriptor:
class Desc(object):

def __get__(self, obj, objtype):
# obj will be None if the descriptor is invoked on the class.
print "Invocation!"
return obj.x + 10

class A(object):

def __init__(self, x):
self.x = x

plus_ten = Desc()

a = A(5)
a.plus_ten
If you attach a descriptor to an instance instead of a class, it won't work.

There is more than one __getattribute__ method:
A.x => type.__getattrbute__(A, 'x')
a.x => object.__getattribute__(a, 'x')
By overriding __getattribute__, you "own the dot".

"Super Considered Super" was a blog post he wrote to refute "Super Considered Harmful".

__mro__ gives you the method resolution order.

super() doesn't necessarily go to the parent class of the current class. It's all about the instance's ancestor tree. super() might go to some other part of the the instance's MRO, some part that your class doesn't necessarily know about.

Functions are descriptors. If you attach a function to a class dictionary, it'll add the magic for bound methods.

A.__dict__['func'] returns a normal function. A.func returns an unbound method. A().func returns a bound method.

Here is an example of using slots.

Here is another example of using slots:
class A(object):
__slots__ = 'x', 'y'
If you have an instance of a class that uses slots, then it won't have a __dict__ attribute.

The type metaclass controls how classes are created. It supplies them with __getattribute__.

"Descriptors are how most of the modern features of Python were built.""

At this point in the day, my brain was dead, and he was about to start talking about Unicode. I'm not sure that saving Unicode for the end of the day is the best strategy ;)

He said that "unicode" should be called "unitable".

Unicode is a dictionary of code points to strings. The glyphs are not part of Unicode. They're part of a font rendering engine.

There are more than 100k unicode code points.

Microsoft and Apple worked hard on Arial so that it has glyphs for almost every codepoint.

from unicodedata import category, name

Arabic and Chinese have their own glyphs for digits. int works correctly with all the different ways to write numbers.

There are two ways to write an umlat O because of combining characters.

Use "unicodedate.normalize('NFC', s)" to normalize the combining characters.

Arabic and Hebrew are written right-to-left--but not for numbers!

There are unicode control characters to switch which direction you're writing:
U+200E is for left-to-right
U+200F is for right-to-left
If you slice a string, you might accidentally chop off the unicode control character which causes the text to be backwards.

Just google for "bidi unicode" to get lots of help.

Most machines are little endian, but the Internet is big endian. Computers byte swap a lot, but they do it in hardware.

Code pages assume that the only people in the world are "us and the Americans." Everyone else gets question marks.

Encodings with "utf" in them do not lose information for any language. Any other encoding does.

If you use UTF-8, you lose the ability to get O(1) random access to characters in the string.

UTF-8 gives you some compression compared to fixed-width encodings, but not much.

The three main unicode problems are "many-to-one, one-to-many, and bidi."

Doubly encoding something or doubly decoding something is a super common problem.

If some characters don't display, it's probably a font problem. Try Arial.

The "one true encoding" is "UTF-8" (according to Tim Berners Lee).

UTF-8 is a superset of ASCII.

UTF-8 has holes. I.e. there are some number combinations that are not valid.

There's a lot of data in the world that is still encoded in UCS2. It's a two byte encoding.

It was a presidential order that caused us to move from EBCDIC to ASCII.

It was the Chinese government that decided UCS2 was not acceptable.

UTF-16_be is a superset of UCS2.

There are only a handful of Chinese characters that don't fit into UCS2. The treble clef is a character that won't fit in UCS2.

To figure out what encoding something is in, HTTP has headers and email has MIME types.

If a browser wants to guess at an encoding, it'll try all the encodings and look for character frequency distributions. You can fool such a browser by giving it a page that says, "which witch has which witch?"

Mojibake is when you get your characters mixed up because you guessed the encoding wrong.

Thursday, March 22, 2012

Python: Scalability at YouTube

Mike Solomon and I gave a talk at PyCon called "Scalability at YouTube". They just posted the video on pyvideo.org.

Saturday, March 10, 2012

Python: The Year of PyPy

I think we're going to look back on this PyCon and remember it as the year that PyPy took over the world.

Thursday, March 08, 2012

Personal: Booth Babe


Yesterday at GDC, I achieved my goal of becoming a booth babe. I stood at a booth for 7 hours and answered questions about integrating YouTube video upload functionality into video games. Man are my feat sore! Oh well, at least I didn't have to wear heals ;)

Thursday, March 01, 2012

Ruby: Using YouTube APIs for Education

I gave a talk at the East Bay Ruby Meetup and the San Francisco Ruby Meetup called Using YouTube APIs for Education. In the talk, I covered YouTube.com/EDU, Google client libraries for Ruby, OAuth2, and doing TDD with web services using Pry and WebMock.

See also this talk on YouTube.com/EDU.