Tuesday, April 29, 2014

PyCon Notes: Introduction to SQLAlchemy

At PyCon, Mike Bayer, the author of SQLAlchemy, gave a three hour tutorial on it. Here's the video. What follows are my notes.

He used something called sliderepl. sliderepl is a nice ASCII tool that's a mix of slides and a REPL. You can flip through his source code / slides in the terminal. It's actually pretty neat.

At the lowest level, SQLAlchemy sends strings to the database and interprets the responses.

It took him 10 years to write it. He started the project in 2005. He's finally going to hit 1.0.

SQLAlchemy forces you to be aware of transactions.

Isolation models have to do with how ongoing transactions see ongoing work amongst each other.

Goals:

    Provide helpers, etc.

    Provides a fully featured facade over the Python DBAPI.

    Provide an industrial strength, but optional, ORM.

    Act as a base for inhouse tools.

Philosophies:

    Make the usage of different DBs and adaptors as consistent as possible.

    But still expose unique features in each backend.

    It's not realistic for an ORM to perfectly match any DB. He mentioned
    leaky abstractions.

    Don't hide the DB. You must continue to think in SQL.

    Provide automation and DRY.

    Allow expression using declarative patterns.

Here's the stack:

    SQLAlchemy ORM
    SQLAlchemy Core:
        Schema/Types.
        SQL expression language.
        Engine:
            Connection pooling.
            Dialects.
    DBAPI
        There are different libraries for different DBs.
    Database

Reddit and Dropbox use SQLAlchemy Core without the ORM.

An "Engine" is a registry which provides connectivity to a particular DB server.

A "Dialect" interprets generic SQL and database commands in terms of a specific DBAPI and DB backend.

A "Connection Pool" holds a collection of DB connections in memory for fast reuse. It also handles reconnecting if the connection drops.

The "SQL Expression Language" allows SQL to be written using Python expressions.

"Schema/Types" uses Python objects to represent tables, columns, and datatypes.

The ORM:

    Allows construction of Python objects which can be mapped to relational DB
    tables.

    Transparently persists objects into their corresponding DB tables using
    the "unit of work" pattern.

    Provides a query system which loads objects and attributes using SQL
    generated from mappings.

    Builds on top of the Core.

    Presents a slightly more object centric perspective as opposed to a schema
    centric perspective.

SQLAlchemy is like an onion:

    ORM:
        SQL Expressions:
            Table Metadata, Reflection, DDL:
                Engine, Connection, Transactions

Level 1: Engine, Connection, Transactions

The DBAPI is the Python Database API. It's the defacto system for providing Python DB interfaces.

Most DBs have more than one DBAPI implementation.

MySQL has more than 10, and SQLAlchemy has to support about 6 of them.

They vary wildly.

He showed an example of using the DBAPI directly.

His favorite DBAPI implementation is psycopg2 (for PostgreSQL).

The DBAPI assumes that a transaction is always in progress. There is no begin() method, only commit() and rollback().

The DBAPI encourages bound parameters, via the execute() and executemany() methods, but it has six different formats.

All DBAPIs have inconsistencies regarding lots of things.

The DBAPI has its own exception hierarchy.

The first layer in SQLAlchemy is known as the "Engine", which is the object that maintains the classical DBAPI interaction.

SQLAlchemy has a lazy initialization pattern. It does a lot of lazy stuff.

An engine is a "factory" for connections.

    from sqllite import create_engine

    engine = create_engine("sqlite:///some.db")
    result = engine.execute("somequery ...=:empid", empid=...)
    row = result.fetchone()

At this level of the API, the syntax for quoting variables (:empid) is whatever the DBAPI expects.

row is a tuple as well as a dict.

You can loop over the result:

    for row in result:

The actual cursor is result.cursor.

Under the covers, result.cursor.description has the names of the fields.

There's also result.fetchall().

You can control the scope of the connection using connect():

    conn = engine.connect()
    result = conn.execute(query)
    result.fetchall()
    conn.close()

Transactions:

    trans = conn.begin()
    trans.commit()
    conn.close()

If you don't use a transaction explicitly, it'll use autocommit. Everything will be wrapped in a transaction.

The DBAPI doesn't use autocommit by default.

Using with:

    with engine.begin() as conn:
        conn.execute(query)

Turning on debugging:

    engine.echo = True

Connecting directly via the engine is called "connectionless execution". The engine connects and disconnects for us.

Using a Connection explicitly lets you control scope.

There's a connection pool.

Whatever you give to engine.execute() gets passed to the DBAPI directly.

Level 2: Table Metadata, Reflection, DDL

He was inspired by "Patterns of Enterprise Architecture" when he wrote the SQL generation code and the ORM.

You can use SQLAlchemy to generate a schema.

You can also have SQLAlchemy use reflection to load an existing schema.

    from sqlalchemy import MetaData, Table, Column, \
        Integer, String
    metadata = MetaData()
    user_table = Table('user', metadata,
        Column('id', Integer, primary_key=True),
        ...
    )
    user_table.name

String is a varchar.

user_table.c has all the columns:

    user_table.c.name.type

Using it:

    user_table.select().where(user_table.c.fullname == 'asdf'))

Creating tables:

    metadata.create_all(engine)

Types:

    String(50)
    DateTime
    Numeric(10, 2)
    Enum('a', 'b', 'c')
    ...

Constraints:

    from sqlalchemy import ForeignKey
    ...
    Column('user_id', Integer, ForeignKey('user.id'))

The references are lazy, so user can be created later.

Composite foreign keys:

    ...
    Column('story_id', Integer),
    Column('version_id', Integer),
    ForeignKeyConstraint(
        ['story_id', 'version_id'],
        ['story.story_id', 'story.version_id']
    )

There's a special setting to tell it to add foreign key dependencies later. Use this if you have mutually dependent tables.

Nullable is True by default. Here's how to make it False:

    nullable=False

You can combine lines and leave out the type for foreign keys:

    Column('owner_id', ForeignKey('user.id'))

Reflection:

    metadata2 = MetaData()
    user_reflected = Table('user', metadata2, autoload=True,
                           autoload_with=engine)
    user_reflected.c

It took a really long time for him to learn how to meet everyone's needs.

Don't use "bound metadata". It's an antipattern.

Another system (not using metadata):

    from sqlalchemy import inspect
    inspector = inspect(engine)
    inspector.get_table_names()
    inspector.get_columns('address')
    inspector.get_foreign_keys('address')

Types and Querying

Types:

    Integer
    String
    Unicode
    Boolean
    DateTime
    Float
    Numeric (a decimal)

Create and drop:

    metadata.create_all()
    table.create()
    metadata.drop_all()
    table.drop()

These are classes. They have magic methods like __eq__, etc.

    user_table.c.username.__eq__

This returns a BinaryExpression to be used as part of an expression:

    user_table.c.username == 'ed'

Or:

    ((user_table.c.username == 'ed') |
  (user_table.c.username == 'jack'))

There's also and_ and or_.

There's also >, etc.

There's also == None. It translates to "is NULL".

There are lots of operators.

    user_table.c.something.in_(...)

Execute:

    engine.execute(
        user_table.select().where(user_table.c.username == 'ed')
    )

Dialects:

    from sqlalchemy.dialects import postgresql
    expression.compile(dialect=postgresql.dialect())

Inserts:

    insert_stmt = user_table.insert().values(
        username='ed',
        fullname='Ed Jones'
    )
    conn = engine.connect()
    result = conn.execute(insert_stmt)

Inserting many:

    conn.execute(user_table.insert(), [
        {'username': 'jack', ...},
    ])

Select:

    conn.execute(
        select([user_table.c.username, user_table.c.fullname])
            .where(user_table.c.username == 'ed'))

Select all:

    conn.execute(select([user_table]).where....)

If you use where multiple times, it ands them together.

The result for an insert will have the primary key.

Looking at a statement:

    stmt = ...
    print stmt
    conn.execute(stmt)

Look at the rowcount on the result to see how many rows were affected.

Level 4: ORM

The OO classes are called a "domain model".

The most basic level is:

    object.save()
    class.load()

Some ORMs can represent multiple rows as domain objects.

Most ORMs can do composition using foreign key associations.

SQLAlchemy can represent class inheritance hiearchies.

SQLAlchemy can handle sharding.

Flavors of ORMs:

    Active Record: Domain objects handle their own persistence.

    Data Mapper: It tries to keep the details of persistence separate from the
    object being persisted.

There's also different ways of configuring them. Most use an "all-at-once", or declarative style, where class and table information is together.

Another style is to have the class declaration and the ORM mapper configured separately.

Hibernate is more pure, but it's tedious and verbose.

That's how SQLAlchemy worked in the beginning, but then he realized it was too verbose.

The SQLAlchemy ORM is a data mapper on the bottom, but it has a declarative style on top.

.save() is lazy. It doesn't flush immediately.

Objects are identical if they have the same primary key.

SQLAlchemy uses lazy loading.

SQLAlchemy is not compatible with Tornado and Twisted because they're callback oriented, and the lazy loading conflicts with that.

It also supports eager loading.

It also has method chaining.

Using the ORM:

    from sqlalchemy.ext.declarative import declarative_base
    Base = declarative_base()

    class User(Base):
        __tablename__ = 'user'
        id = Column(Intger, primary_key=True)
        name = Column(String)
        ...

        def __repr__(self):
          return "" % (self.name, self.fullname)

    User.__table__

The Mapper links a class to a table:

    User.__mapper__

Other stuff:

    Base.metadata
    Base._decl_class_registry

Creating a user:

    ed_user = User(name='ed', ...)

Setting stuff up:

    engine = ...
    session = Session(bind=engine)

The session is an ORM object.

    session.add(ed_user)

Query:

    our_user = session.query(User).filter_by(name='ed').first()

Instead of where, you use filter_by.

As soon as the session begins its work, it starts a transaction.

It keeps the identity map, so "is" works for ORM objects. Generally, other Active Record ORMs get this wrong.

Add all:

    session.add_all([
        User(...), ...
    ])

session.dirty has state changes.

session.new has the new objects.

session.commit() flushes everything and commits the transaction.

session.flush() flushes changes to the database without doing a commit.

After a commit, there is no transaction. All data is invalidated. Accessing data will start a new transaction and re-load the data from the database.

He learned stuff from Storm and Hibernate about how to do things this way.

Expiring after the commit is the right thing to do.

session.rollback() also ends the transaction.

The object is a proxy to a row in the database. You only know about stuff in the database when there's a transaction in play.

One of the challenges with SQLAlchemy is performance, especially when setting a ton of attributes.

The core will do whatever you want. The ORM is more opinionated. It's clearer about how things should be done.

He has to be really careful about adding features. A lot of times, he's had to remove features that were not well thought out.

Here's a fuller example:

    class Network(Base):
        __tablename__ = 'network'
        network_id = Column(Integer, primary_key=True)
        name = Column(String(100), nullable=False)

    Base.metadata.create_all(engine)
    session
    session.add_all([Network(...), ...])

nullable is True by default.

Inner workings:

    User.name.property.columns[0]
    User.__table__
    User.__table__.c.name == 'ed'

But, you can write higher-level code such as:

    User.name == 'ed'

Query:

    query = session.query(User).filter(
        User.name == 'ed'
    ).order_by(User.id)
    query.all()

    for name, fullname in session.query(
        user.name, User.fullname
    ):
        ...

Create a dict really easily:

    d = dict(session.query(User.name, User))

Using order, limit, offset:

    session.query(User).order_by(...)[1:3]

filter is for full-blown expressions:

    (User.something == "something")

filter_by is for filtering by a specific field:

    (something == "something")

If you use multiple filters in a chain, they are anded.

.first() limits things to one row.

.one() asserts that there is one and only one row This may raise NoResultFound or MultipleResultsFound.

Examples:

    q = session.query(User.fullname).order_by(User.fullname)
    q.all()

    q2 = q.filter(or_(User.name == 'mary', User.name == 'ed'))
    print q2[1]

Advanced ORM Usage

Start with the same mappings as above.

What he showed above is a little verbose. There are shortcuts.

Unlike other ORMs, SQLAlchemy doesn't assume you're going to have a primary key named id. He was trying to make it appealing for existing databases.

It doesn't decide what the database should look like. This makes it really explicit.

There are ways to enforce conventions using mixins.

    class Address(Base):

        ...
        user_id = Column(Integer, ForeignKey('user.id'))

        # This gives you address.user and user.addresses.
        user = relationship("User", backref="addresses")

This stuff works with reflection too.

    jack.addresses = [
        Address(...),
        ...
    ]

You can do anything to jack.addresses that you can do with a list.

If you run session.add(jack), it adds everything else automatically.

This is where ORMs can save you a lot of time. They make persisting stuff much easier.

The collection stays in memory until the transaction ends.

There is a way to use eager loading to avoid the N+1 queries problem.

As soon as you begin working with the session, it starts a new transaction.

The only time you need to use session.begin() is if you explicitly turned on autocommit.

If you don't care about the transaction, just .close() it. Then the cursor pool will do whatever it does.

When he does a POST in a web app, he'll do an explicit commit().

SQLAlchemy uses a connection pool by default.

Collections and references are updated by manipulating objects, not primary / foreign key values.

An implicit join:

    session.query(User, Address).filter(
        User.id == Address.user_id
    ).all()

Without the filter, you get the cartesian product.

Here's an explicit join:

    session.query(User, Address).join(
        Address, User.id == Address.user_id
    ).all()

Using the relationship:

    session.query(User, Address).join(User.addresses).all()

Self join:

    from sqlalchemy.orm import aliased

    a1, a2 = aliased(Address), aliased(Address)
    session.query(User).join(a1).join(a2).filter(
        a1.email_address == '...'
    ).filter(a2.email_address == '...').all()

There's a group_by.

Add .subquery() to the end to make it a subquery:

    subq = session.query(
        func.count(Address.id).label('count'),
        User.id.label('user_id')
    ).join(Address.user).group_by(User.id).subquery()

    ...outerjoin(subq, User.id == subq.c.user_id)...

Here's how to avoid the N+1 problem:

    for user in session.query(User).options(
        subqueryload(User.addresses)
    ):
        print(user, user.addresses)

There's a really good chapter in the docs that covers this stuff.

Use Alembic for migrations.

People used to use Elixir on top of SQLAlchemy, but Elixir isn't being maintained anymore. Everything you can do in Elixir can be done using Declarative in SQLAlchemy.

Wednesday, April 23, 2014

PyCon Notes: PostgreSQL Proficiency for Python People

In summary, this tutorial was fantastic! I learned more in three hours than I would have learned if I had read a whole book!

Here's the video. Here are the slides. Here are my notes:

Christoph Pettus was the speaker. He's from PostgreSQL Experts.

PostgreSQL is a rich environment.

It's fully ACID compliant.

It has the richest set of features of any modern, production RDMS. It has even more features than
Oracle.

PostgreSQL focuses on quality, security, and spec compliance.

It's capable of very high performance: tens of thousands of transactions per second, petabyte-sized data sets, etc.

To install it, just use your package management system (apt, yum, etc.). Those systems will usually take care of initialization.

There are many options for OS X. Heroku even built a Postgres.app that runs more like a foreground app.

A "cluster" is a single PostgreSQL server (which can manage multiple databases).

initdb creates the basic file structure. PostgreSQL has to be up and running to run initdb.

To create a database:

sudo su - postgres
psql

create database this_new_database;

To drop a database:

drop database this_new_database;

Debian runs initdb for you. Red Hat does not.

Debian has a cluster management system. Use it. See, for instance, pg_createcluster.

Always create databases as UTF-8. Once you've created it, you can't change it.

Don't use SQLASCII. It's a nightmare. Don't use "C locale".

pg_ctl is a built-in command to start and stop PostgreSQL:

cd POSTGRES_DIRECTORY
pg_ctl -D . start

Usually, pg_ctl is wrapped by something provided by your platform.

On Ubuntu, start PostgreSQL via:

service postgresql start

Always use "-m fast" when stopping.

Postgres puts its own data in a top-level directory. Let's call it $PGDATA.

Don't monkey around with that data.

pg_clog and pg_xlog are important. Don't mess with them.

On most systems, configuration lives in $PGDATA.

postgresql.conf contains server configuration.

pg_hba.conf contains authentication settings.

postgresql.conf can feel very overwhelming.

Avoid making a lot of changes to postgresql.conf. Instead, add the following to it:

include "postgresql.conf.include"

Then, mess with "postgresql.conf.include".

The important parameters fall into these categories: logging, memory, checkpoints, and the planner.

Logging:

Be generous with logging. It has a very low impact on the system. It's your best source of info for diagnosing problems.

You can log to syslog or log CSV to files. He showed his typical logging configuration.

He showed his guidelines / heuristics for all the settings, including how to finetune things. They're really good! See his slides.

As of version 9.3, you don't need to tweak Linux kernel parameters anymore.

Do not mess with fsync or  synchronous_commit.

Most settings require a server reload to take effect. Some things require a server restart. Some can be set on a per-session basis. Here's how to do that. This is also an example of how to use a transaction:

begin;
set local random_page_cost = 2.5;
show random_page_cost;
abort;

pg_hba.conf contains users and roles. Roles are like groups. They form a hierarchy.

A user is just a role with login privs.

Don't use the "postgres" superuser for anything application-related.

Sadly, you probably will have to grant schema-modification privs to your app user if you use migrations, but if you don't have to, don't.

By default, DB traffic is not encrypted. Turn on SSL if you are running in a cloud provider.

In pg_hba.conf, "trust" means if they can log into the server, they can access Postgres too. "peer" means they can have a Postgres user that matches their username. "md5" is an md5 hash password.

It's a good idea to restrict the IP addresses allowed to talk to the server fairly tightly.

The WAL

The Write-Ahead Log is key to many Postgres operations. It's the basis for replication, crash recovery, etc.

When each transaction is committed, it is logged to the write-ahead log.

Changes in the transaction are flushed to disk.

If the system crashes, the WAL is "replayed" to bring the DB to a consistent state.

It's a continuous record of changes since the last checkpoint.

The WAL is stored in 16MB segments in the pg_xlog directory.

Never delete anything from pg_xlog.

archive_command is a way to move the WAL segments to someplace safe (like a
different system).

By default, synchronous_commit is on, which means that commits do not return until the WAL flush is done. If you turn it off, they'll return when the WAL flush is queued. You might lose transactions in the case of a crash, but there's no risk of database corruption.

Backup and Recovery

Experience has shown that 20% of the time, your EBS volumes will not reattach when you reboot in AWS.

pg_dump is a built-in dump/restore tool.

It takes a logical snapshot of the database.

It doesn't lock the database or prevent writes to disk.

pg_restore restores the database. It's not fast.

It's great for simple backups but not suitable for fast recovery from major failures.

pg_bench is the built in benchmarking tool.

pg_dump -Fc --verbose example > example.dump

Without the -Fc, it dumps SQL commands instead of its custom format.

pg_restore --dbname=example_restored --verbose example.dump

pg_restore takes a long time because it has to recreate indexes.

pg_dumpall --globals-only

Back up each database with pg_dump using --format=custom.

To do a parallel restore, use --jobs=.

If you have a large database, pg_dump may not be appropriate.

A disk snapshot + every WAL segment is enough to recreate the database.

To start a PITR (point in time recovery) backup:

select pg_start_backup(...);

Copy the disk image and any WAL files that are created.

select pg_stop_backup();

Make sure you have all the WAL segments.

The disk image + all the WAL segments are enough to create the DB.

See also github.com/wal-e/wal-e. It's highly recommended.

It automates backups to S3.

He explained how to do a PITR.

With PITR, you can rollback to a particular point in time. You don't have to replay everything.

This is super handy for application failures.

RDS is something that scripts all this stuff for you.

Replication

Send the WAL to another server.

Keep the server up to date with the primary server.

That's how PostgreSQL replication works.

The old way was called "WAL Archiving". Each 16MB segment was sent to the secondary when complete. Use rsync, WAL-E, etc., not scp.

The new way is Streaming Replication.

The secondary gets changes as they happen.

It's all setup via recovery.conf in your $PGDATA.

He showed a recovery.conf for a secondary machine, and showed how to let it become the master.

Always have a disaster recovery strategy.

pg_basebackup is a utility for doing a snapshot of a running server. It's the easiest way to take a snapshot to start a new secondary. It's also useful for archival backups. It's not the fastest thing, but it's pretty foolproof.

Replication:

The good:

Easy to setup.

Schema changes are replicated.

Secondaries can handle read-only queries for load balancing.

It either works or it complains loudly.

The bad:

You get the entire DB cluster or none of it.

No writes of any kind to the secondary, not even temporary tables.

Some things aren't replicated like temporary tables and unlogged tables.

His advice is to start with WAL-E. The README tells you everything. It fixes a ton of problems.

The biggest problem with WAL-E is that writing to S3 can be slow.

Another way to do funky things is trigger-based replication. There's a bunch of third-party packages to do this.

Bucardo is one that lets you do multi-master setups.

However, they're fiddly and complex to set up. They can also fail quietly.

Transactions, MVCC, and Vacuum

BEGIN;
INSERT ...;
INSERT ...;
COMMIT;

By the way, no bank works this way ;)

Everything runs inside of a transaction.

If there is no explicit transaction, each statement is wrapped in one for you.

Everything that modifies the database is transactional, even schema changes.

\d shows you all your tables.

With a transaction, you can even rollback a table drop.

South (the Django migration tool) runs the whole migration in a single transaction.

Many resources are held until the end of a transaction. Keep your transactions brief and to the point.

Beware of "IDLE IN TRANSACTION" sessions. This is a problem for Django apps.

A tuple in Postgres is the same thing as a row.

Postgres uses Multi-Version Concurrency Control. Each transaction sees its own version of the database.

Writers only block writers to the same tuple. Nothing else causes blocking.

Postgres will not allow two snapshots to "fork" the database. If two people try to write to the same tuple, Postgres will block one of them.

There are higher isolation modes. His description of them was really interesting.

He suggested that new apps use SERIALIZABLE. This will help you find the concurrency errors in your app.

Deleted tuples are not usually immediately freed.

Vacuum's primary job is to scavenge tuples that are no longer visible to any transaction.

autovacuum generally handles this problem for you without intervention (since version 8).

Run analyze after a major database change to help the planner out.

If someone tells you "vacuum's not working", they're probably wrong.

The DB generally stabilizes at 20% to 50% bloat. That's acceptable.

The problem might be that there are long-running transactions or idle-in-transaction sessions. They'll block vacuuming. So will manual table locking.

He talked about vacuum issues for rare situations.

Schema Design

Normalization is important, but don't obsess about it.

Pick "entities". Make sure that no entity-level info gets pushed into the subsidiary items.

Pick a naming scheme and stick with it.

Plural or singular? DB people tend to like plural. ORMs tend to like singular.

You probably want lower_case to avoid quoting.

Calculated denormalization can sometimes be useful; copied denormalization is almost never useful.

Joins are good.

PostgreSQL executes joins very efficiently. Don't be afraid of them.

Don't worry about large tables joined with small tables.

Use the typing system. It has a rich set of types.

Use domains to create custom types.

A domain is a core type + a constraint.

Don't use polymorphic fields (fields whose interpretation is dependent on another field).

Don't use strings to store multiple types.

Use constraints. They're cheap and fast.

You can create constraints across multiple columns.

Avoid Entity-Attribute-Value schemas. They cause great pain. They're very inefficient. They make reports very difficult.

Consider using UUIDs instead of serials as synthetic keys.

The problem with serials for keys is that merging tables can be hard.

Don't have "Thing" tables like "Object" tables.

If a table has a few frequently-updated fields and a few slowly-updated fields, consider splitting the table. Split the fast-moving stuff out into a separate 1-to-1 table.

Arrays are a first-class type in PostgreSQL. It's a good substitute for using a subsidiary table.

A list of tags is a good fit for arrays.

He talked about hstore. It's much better than Entity-Attribute-Value. It's great for optional, variable attributes. It's like a hash. It can be indexed, searched, etc. It lets you add attributes to tables for users. Don't use it as a way to avoid all table modifications.

json is now a built in type.

There's also jsonb.

Avoid indexes on big things, like 10k character strings.

NULL it a total pain in the neck.

Only use it to mean "missing value".

Never use it to represent a meaningful value.

Let's call anything 1MB or more a "very large object". Store them in files. Store the metadata in the database. The database API is just not a good fit for this.

Many-to-many tables can get extremely large. Consider replacing them with array fields (either one way or both directions). You can use a trigger to maintain integrity.

You don't want more than about 250k entries in an array.

Use UTF-8. Period.

Always use TIMESTAMPTZ (which Django uses by default). Don't use TIMESTAMP. TIMESTAMPTZ is a timestamp converted to UTC.

Index types:

B-Tree

Use a B-Tree on a column if you frequently query on that column,
use one of the comparison operators, only get back 10-15% of the rows,
and run that query frequently.

It won't use the index if you're going to get back more than 15% of
the rows because it's faster to scan a table then scan an index.

Use a partial index if you can ignore most of the rows.

The entire tuple has to be copied into the index.

GiST

It's a framework to create indexes.

KNN indexes are the K-nearest neighbors.

GIN

Generalized inverted index. Used for full-text search.

The others either are not good or very specific.

Why isn't it using my index?

Use explain analyze to look at the query.

If it thinks it's going to require most of the rows, it'll do a table scan.

If it's wrong, use analyze to update the planner stats.

Sometimes, it can't use the index.

Two ways to create an index:

create index

create index concurrently

reindex rebuilds an index from scratch.

pg_stat_user_indexes tells you about how your indexes are being used.

What do you do if a query is slow:

Use explain or explain analyze.

explain doesn't actually run the query.

"Cost" is measured in arbitrary units. Traditionally, they have been "disk fetches". Costs are inclusive of subnodes.

I think explain analyze actually runs the query.

Things that are bad:

Joins between 2 large tables.

Cross joins (cartesian products). These often happen by accident.

Sequential scans on large tables.

select count(*) is slow because it results in a full table scan since you
have to see if the tuples are alive or dead.

offset / limit. These actually run the query and then throw away that many
rows. Beware that GoogleBot is relentless. Use other keys.

If the database is slow:

Look at pg_stat_activity:

select * from pg_stat_activity;

tail -f the logs.

Too much I/O? iostat 5.

If the database isn't responding:

Try connecting with it using psql.

pg_stat_activity

pg_locks

Python Particulars

psycopg2 is the only real option in Python 2.

The result set of a query is loaded into client memory when the query completes. If there are a ton of rows, you could run out of memory. If you want to scroll through the results, use a "named" cursor. Be sure to dispose of it properly.

The Python 3 situation is not so great. There's py-postgresql. It's pure Python.

If you are using Django 1.6+, use the @atomic decorator.

Cluster all your writes into small transactions. Leave read operations outside.

Do all your writes at the very end of the view function.

Multi-database works very nicely with hot standby.

Point the writes at the primary, and the reads at the secondary.

For Django 1.5, use the @xact decorator.

Sloppy transaction management can cause the dreaded Django idle-in-transaction problem.

Use South for database migration. South is getting merged into Django in version 1.7 of Django.

You can use manual migrations for stuff the Django ORM can't specify.

Special Situations

Upgrade to 9.3.4. Upgrade minor versions promptly.

Major version upgrades require more planning. pg_upgrade has to be run when the database is not running.

A full pg_dump / pg_restore is always the safest, although not the most practical.

Always read the release notes.

All parts of a replication set must be upgraded at once (for major versions).

Use copy, not insert, for bulk loading data. psycopg2 has a nice interface. Do a vacuum afterwards.

AWS

Instances can disappear and come back up without instance storage.

EBS can fail to reattach after reboot.

PIOPS are useful (but pricey) if you are using EBS.

Script everything, instance creation, PostgreSQL, etc. Use Salt. Use a VPC.

Scale up and down as required to meet load. If you're just using them to rent a server, it's really expensive.

PostgreSQL RDS is a managed database instance. Big plus: automatic failover! Big minus: you can't read from the secondary. It's expensive. It's a good place to start.

Sharding

Eventually, you'll run out of write capacity on your master.

postgres-xc is an open source fork of PostgreSQL.

Bucardo provides multi-master write capability.

He talked about custom sharding.

Instagram wrote a nice article about it.

Pooling

Opening a connection is expensive. Use a pooler.

pgbouncer is a pooler.

pgPool II can even do query analysis. However, it has higher overhead and is more complex to configure.

Tools

Monitor everything.

check_postgres.pl is a plugin to monitor PostgreSQL.

pgAdmin III and Navicat are nice clients.

pgbadger is for log analysis. So is pg_stat_statements.

Closing

MVCC works by each tuple having a range of transaction IDs that can see that
tuple.

Failover is annoying to do in the real world. People use HAProxy, some pooler, etc. with some scripting, or they have a human do the failover.

HandyRep is a server-based tool designed to allow you to manage a PostgreSQL "replication cluster", defined as a master and one or more replicas on the same network.

Dagger: A Dependency Injection Framework for Android and Java

Dagger is a new dependency injection framework for Android and Java. I went to a meetup yesterday to learn more about it. These are my notes:

The talk was by Jake Wharton who works at Square.

Every single app has some form of DI. You can do DI even if you're not using a library for doing it. The goal of DI is to separate the behavior of something from its required classes. If you've ever used a constructor to receive stuff, you've done a simple version of DI.

Square used Guice heavily.

Problems with Guice:
Config problems fail at runtime. 
Slow initialization, slow injection, and memory problems.

These are worse on Android. It causes the OS to load all the code for your app at once. This caused their app to take 2 seconds to start.

They called Dagger "Object Graph" initially.

Goals of Dagger:

Static analysis of all dependencies and injections. 
Fail as early as possible--compile time, not runtime.
Eliminate the need to do reflection of methods and annotations at runtime. Reflection in Dalvik is really slow.
Have negligible memory impact.

Jesse Wilson wrote it over the course of 5 weeks. He previously worked on Guice and Dalvik.

Square switched from Guice to Dagger in a fairly short period of time.

The name Dagger refers to "directed acyclic graph".

An ObjectGraph is the central dependency manager and injector.

@Module + @Provides

@Inject

@Singleton

Modules are meant to be composed together.

@Inject is required.

Field injection or constructor injection.

Dependencies can be stored in private final fields.

If you have @Inject, you don't have to say @Provides.

Injected fields cannot be private or final. They can be package protected.

Object graphs can be scoped. One object graph is a superset of another. For instance, you might create a new object graph once the user logs in that contains all of the objects that are require a user object.

Android

The Android platform makes it really hard to test your apps. He showed how they deal with it.

The ObjectGraph is just another object. Hence, you can pass it around like a normal object.

There's one pain point in Dagger. All injection points must be listed on a module. This is used for aggressive static analysis.

Use overrides to facilitate testing.

He showed how to integrate with Gradle.

They have a "Debug Drawer" in their apps. It is hidden in the UI, but lets you configure all sorts of debug settings.

U+2020 is a sample app to show how to do all of this.

Using DI Incorrectly

Do NOT ignore the pattern.

Do NOT make every class use the pattern. Use it for the big things, such as the things that talk to a remote API.

Do NOT store dependencies as static fields.

Other Stuff

The Android docs say not to use DI. That advice is stale. Those complaints don't apply to Dagger.

Dagger is developer and debugger friendly.

The Future

Dagger has been out for 18 months.

They're working on the next major version, version 2.0.

Google is leading the development of the next version.

It won't use any reflection.

Dagger is not Android specific.

They're getting rid of injects lists.

Components encapsulate dependencies.

There will be dedicated annotations to denote scopes.

See squ.re/dagger2.

Questions

They use protocol buffers for their APIs. They use a schema for their APIs.

They have code that can generate a GraphViz file that shows you your dependency graph.

In their apps, they have a network module, an Android module, an app module, etc.

There's an IntelliJ plugin that lets you jump between @Inject and @Provides.