Friday, March 23, 2012

PyCon: Apache Cassandra and Python

See the website.

See the slides.

He doesn't cover setting up a production cluster.

Using a schema is optional.

Cassandra is like a combination of Dynamo from Amazon and BigTable from Google.

It uses timestamps for conflict resolution. The clients determine the time. There are other approaches to conflict resolution as well.

Data in Cassandra looks like a multi-level dict.

By default, Cassandra eats 1/2 of your RAM. You might want to change that ;)

He uses pycassa for his client. It's the simplest approach.

telephus is a Cassandra client for Twisted.

cassandra-dbapi2 is a Cassandra client that supports DBAPI2. It's based on Cassandra's new CQL interface.

Don't use pure Thrift to talk to Cassandra.

Cassandra is good about scaling up linearly.

There's a batch interface and a streaming interface.

There's a lot of flexibility concerning column families. You can even have columns representing different periods in time.

Pycassa supports different data types.

Pycassa has an interface that looks a little more like an ORM.

It has native indexes. However, indexes are not recommended for "high cardinality" values like timestamps or keywords.

1 comment:

benslin kard said...

Cassandra implements the BigTable data model but uses a design where data storage is distributed over symmetric nodes