I knew that this was called the flyweight design pattern, but I didn't know if it was already implemented somewhere in Python. (Strictly speaking, I thought it was called the "flywheel" design pattern, and my buddy Drew Perttula corrected me.)
My first attempt was to write code like:
>>> s1 = "foo"This code looks up the word "foo" by value and returns the same instance every time. Notice, it works.
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
True
>>> s1 is s2
False
>>> identity_cache = {}
>>> s1 = identity_cache.setdefault(s1, s1)
>>> s2 = identity_cache.setdefault(s2, s2)
>>> s1 == 'foo'
True
>>> s1 == s2
True
>>> s1 is s2
True
However, Monte Davidoff pointed out that this is what the intern builtin is for. From the docs:
Enter string in the table of ``interned'' strings and return the interned string - which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup - if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys. Changed in version 2.3: Interned strings are not immortal (like they used to be in Python 2.2 and before); you must keep a reference to the return value of intern() around to benefit from it.Here it is in action:
>>> s1 = "foo"Well did it work? My program still functions, but I didn't get a tremendous savings in memory. It turns out that I don't have enough dups, and that's not where I'm spending all my memory anyway. Oh well, at least I learned about the intern() function.
>>> s2 = ''.join(['f', 'o', 'o'])
>>> s1 == s2
True
>>> s1 is s2
False
>>> s1 = intern(s1)
>>> s2 = intern(s2)
>>> s1 == 'foo'
True
>>> s1 == s2
True
>>> s1 is s2
True


7 comments:
This is awesome jj - I didn't know about intern but I've really wanted it lately... I hope it's reasonably effective on jython cuz I'm in the same out of memory issue in our hadoop/jython cluster
I don't have much deep python experience, I am coming from a Java background, but I really want to learn more about python. Do you know if it has any garbage collection (gc) logging like Java? For example, in Java, I can enable gc logging, take thread dumps, and then look for memory leaks or high consumption. I have done a fair bit of debugging/performance tuning in Java this way, but I am wondering if Python offers these options.
@varikin
There is a garbage collection module (called 'gc') and gc.collect() can force (or at least strongly hint) to do a collection with circular issues.
From experience, this usually only postpones the problems a little.
For logging the best means is to just use the hotshot profiler.
If you have large arrays of primitive types I would recommend using the array package from SciPy (www.enthought.com). We were able to reduce the memory by over an order of magnitude just by having 5 arrays (3 int, 1 float, and 1 string) instead of a class with 5 attributes. 3Gig went to 250Meg.
There are some other tricks if you have many, many small objects. Read up on __slots__. (There are some nasty side effects however...)
this is useful and nice trick. Thanks so much :D, I think i will be using like so
So what's next for your program? I want to hear about cool compression schemes that let you work on your data while it's still compressed (like finding and factoring out string prefixes).
I have one more that you might like, but it involves a longer blog post ;)
Nice tips. Thanks, Doug.
Post a Comment