Tuesday, April 29, 2014

PyCon Notes: Introduction to SQLAlchemy

At PyCon, Mike Bayer, the author of SQLAlchemy, gave a three hour tutorial on it. Here's the video. What follows are my notes.

He used something called sliderepl. sliderepl is a nice ASCII tool that's a mix of slides and a REPL. You can flip through his source code / slides in the terminal. It's actually pretty neat.

At the lowest level, SQLAlchemy sends strings to the database and interprets the responses.

It took him 10 years to write it. He started the project in 2005. He's finally going to hit 1.0.

SQLAlchemy forces you to be aware of transactions.

Isolation models have to do with how ongoing transactions see ongoing work amongst each other.

Goals:

    Provide helpers, etc.

    Provides a fully featured facade over the Python DBAPI.

    Provide an industrial strength, but optional, ORM.

    Act as a base for inhouse tools.

Philosophies:

    Make the usage of different DBs and adaptors as consistent as possible.

    But still expose unique features in each backend.

    It's not realistic for an ORM to perfectly match any DB. He mentioned
    leaky abstractions.

    Don't hide the DB. You must continue to think in SQL.

    Provide automation and DRY.

    Allow expression using declarative patterns.

Here's the stack:

    SQLAlchemy ORM
    SQLAlchemy Core:
        Schema/Types.
        SQL expression language.
        Engine:
            Connection pooling.
            Dialects.
    DBAPI
        There are different libraries for different DBs.
    Database

Reddit and Dropbox use SQLAlchemy Core without the ORM.

An "Engine" is a registry which provides connectivity to a particular DB server.

A "Dialect" interprets generic SQL and database commands in terms of a specific DBAPI and DB backend.

A "Connection Pool" holds a collection of DB connections in memory for fast reuse. It also handles reconnecting if the connection drops.

The "SQL Expression Language" allows SQL to be written using Python expressions.

"Schema/Types" uses Python objects to represent tables, columns, and datatypes.

The ORM:

    Allows construction of Python objects which can be mapped to relational DB
    tables.

    Transparently persists objects into their corresponding DB tables using
    the "unit of work" pattern.

    Provides a query system which loads objects and attributes using SQL
    generated from mappings.

    Builds on top of the Core.

    Presents a slightly more object centric perspective as opposed to a schema
    centric perspective.

SQLAlchemy is like an onion:

    ORM:
        SQL Expressions:
            Table Metadata, Reflection, DDL:
                Engine, Connection, Transactions

Level 1: Engine, Connection, Transactions

The DBAPI is the Python Database API. It's the defacto system for providing Python DB interfaces.

Most DBs have more than one DBAPI implementation.

MySQL has more than 10, and SQLAlchemy has to support about 6 of them.

They vary wildly.

He showed an example of using the DBAPI directly.

His favorite DBAPI implementation is psycopg2 (for PostgreSQL).

The DBAPI assumes that a transaction is always in progress. There is no begin() method, only commit() and rollback().

The DBAPI encourages bound parameters, via the execute() and executemany() methods, but it has six different formats.

All DBAPIs have inconsistencies regarding lots of things.

The DBAPI has its own exception hierarchy.

The first layer in SQLAlchemy is known as the "Engine", which is the object that maintains the classical DBAPI interaction.

SQLAlchemy has a lazy initialization pattern. It does a lot of lazy stuff.

An engine is a "factory" for connections.

    from sqllite import create_engine

    engine = create_engine("sqlite:///some.db")
    result = engine.execute("somequery ...=:empid", empid=...)
    row = result.fetchone()

At this level of the API, the syntax for quoting variables (:empid) is whatever the DBAPI expects.

row is a tuple as well as a dict.

You can loop over the result:

    for row in result:

The actual cursor is result.cursor.

Under the covers, result.cursor.description has the names of the fields.

There's also result.fetchall().

You can control the scope of the connection using connect():

    conn = engine.connect()
    result = conn.execute(query)
    result.fetchall()
    conn.close()

Transactions:

    trans = conn.begin()
    trans.commit()
    conn.close()

If you don't use a transaction explicitly, it'll use autocommit. Everything will be wrapped in a transaction.

The DBAPI doesn't use autocommit by default.

Using with:

    with engine.begin() as conn:
        conn.execute(query)

Turning on debugging:

    engine.echo = True

Connecting directly via the engine is called "connectionless execution". The engine connects and disconnects for us.

Using a Connection explicitly lets you control scope.

There's a connection pool.

Whatever you give to engine.execute() gets passed to the DBAPI directly.

Level 2: Table Metadata, Reflection, DDL

He was inspired by "Patterns of Enterprise Architecture" when he wrote the SQL generation code and the ORM.

You can use SQLAlchemy to generate a schema.

You can also have SQLAlchemy use reflection to load an existing schema.

    from sqlalchemy import MetaData, Table, Column, \
        Integer, String
    metadata = MetaData()
    user_table = Table('user', metadata,
        Column('id', Integer, primary_key=True),
        ...
    )
    user_table.name

String is a varchar.

user_table.c has all the columns:

    user_table.c.name.type

Using it:

    user_table.select().where(user_table.c.fullname == 'asdf'))

Creating tables:

    metadata.create_all(engine)

Types:

    String(50)
    DateTime
    Numeric(10, 2)
    Enum('a', 'b', 'c')
    ...

Constraints:

    from sqlalchemy import ForeignKey
    ...
    Column('user_id', Integer, ForeignKey('user.id'))

The references are lazy, so user can be created later.

Composite foreign keys:

    ...
    Column('story_id', Integer),
    Column('version_id', Integer),
    ForeignKeyConstraint(
        ['story_id', 'version_id'],
        ['story.story_id', 'story.version_id']
    )

There's a special setting to tell it to add foreign key dependencies later. Use this if you have mutually dependent tables.

Nullable is True by default. Here's how to make it False:

    nullable=False

You can combine lines and leave out the type for foreign keys:

    Column('owner_id', ForeignKey('user.id'))

Reflection:

    metadata2 = MetaData()
    user_reflected = Table('user', metadata2, autoload=True,
                           autoload_with=engine)
    user_reflected.c

It took a really long time for him to learn how to meet everyone's needs.

Don't use "bound metadata". It's an antipattern.

Another system (not using metadata):

    from sqlalchemy import inspect
    inspector = inspect(engine)
    inspector.get_table_names()
    inspector.get_columns('address')
    inspector.get_foreign_keys('address')

Types and Querying

Types:

    Integer
    String
    Unicode
    Boolean
    DateTime
    Float
    Numeric (a decimal)

Create and drop:

    metadata.create_all()
    table.create()
    metadata.drop_all()
    table.drop()

These are classes. They have magic methods like __eq__, etc.

    user_table.c.username.__eq__

This returns a BinaryExpression to be used as part of an expression:

    user_table.c.username == 'ed'

Or:

    ((user_table.c.username == 'ed') |
  (user_table.c.username == 'jack'))

There's also and_ and or_.

There's also >, etc.

There's also == None. It translates to "is NULL".

There are lots of operators.

    user_table.c.something.in_(...)

Execute:

    engine.execute(
        user_table.select().where(user_table.c.username == 'ed')
    )

Dialects:

    from sqlalchemy.dialects import postgresql
    expression.compile(dialect=postgresql.dialect())

Inserts:

    insert_stmt = user_table.insert().values(
        username='ed',
        fullname='Ed Jones'
    )
    conn = engine.connect()
    result = conn.execute(insert_stmt)

Inserting many:

    conn.execute(user_table.insert(), [
        {'username': 'jack', ...},
    ])

Select:

    conn.execute(
        select([user_table.c.username, user_table.c.fullname])
            .where(user_table.c.username == 'ed'))

Select all:

    conn.execute(select([user_table]).where....)

If you use where multiple times, it ands them together.

The result for an insert will have the primary key.

Looking at a statement:

    stmt = ...
    print stmt
    conn.execute(stmt)

Look at the rowcount on the result to see how many rows were affected.

Level 4: ORM

The OO classes are called a "domain model".

The most basic level is:

    object.save()
    class.load()

Some ORMs can represent multiple rows as domain objects.

Most ORMs can do composition using foreign key associations.

SQLAlchemy can represent class inheritance hiearchies.

SQLAlchemy can handle sharding.

Flavors of ORMs:

    Active Record: Domain objects handle their own persistence.

    Data Mapper: It tries to keep the details of persistence separate from the
    object being persisted.

There's also different ways of configuring them. Most use an "all-at-once", or declarative style, where class and table information is together.

Another style is to have the class declaration and the ORM mapper configured separately.

Hibernate is more pure, but it's tedious and verbose.

That's how SQLAlchemy worked in the beginning, but then he realized it was too verbose.

The SQLAlchemy ORM is a data mapper on the bottom, but it has a declarative style on top.

.save() is lazy. It doesn't flush immediately.

Objects are identical if they have the same primary key.

SQLAlchemy uses lazy loading.

SQLAlchemy is not compatible with Tornado and Twisted because they're callback oriented, and the lazy loading conflicts with that.

It also supports eager loading.

It also has method chaining.

Using the ORM:

    from sqlalchemy.ext.declarative import declarative_base
    Base = declarative_base()

    class User(Base):
        __tablename__ = 'user'
        id = Column(Intger, primary_key=True)
        name = Column(String)
        ...

        def __repr__(self):
          return "" % (self.name, self.fullname)

    User.__table__

The Mapper links a class to a table:

    User.__mapper__

Other stuff:

    Base.metadata
    Base._decl_class_registry

Creating a user:

    ed_user = User(name='ed', ...)

Setting stuff up:

    engine = ...
    session = Session(bind=engine)

The session is an ORM object.

    session.add(ed_user)

Query:

    our_user = session.query(User).filter_by(name='ed').first()

Instead of where, you use filter_by.

As soon as the session begins its work, it starts a transaction.

It keeps the identity map, so "is" works for ORM objects. Generally, other Active Record ORMs get this wrong.

Add all:

    session.add_all([
        User(...), ...
    ])

session.dirty has state changes.

session.new has the new objects.

session.commit() flushes everything and commits the transaction.

session.flush() flushes changes to the database without doing a commit.

After a commit, there is no transaction. All data is invalidated. Accessing data will start a new transaction and re-load the data from the database.

He learned stuff from Storm and Hibernate about how to do things this way.

Expiring after the commit is the right thing to do.

session.rollback() also ends the transaction.

The object is a proxy to a row in the database. You only know about stuff in the database when there's a transaction in play.

One of the challenges with SQLAlchemy is performance, especially when setting a ton of attributes.

The core will do whatever you want. The ORM is more opinionated. It's clearer about how things should be done.

He has to be really careful about adding features. A lot of times, he's had to remove features that were not well thought out.

Here's a fuller example:

    class Network(Base):
        __tablename__ = 'network'
        network_id = Column(Integer, primary_key=True)
        name = Column(String(100), nullable=False)

    Base.metadata.create_all(engine)
    session
    session.add_all([Network(...), ...])

nullable is True by default.

Inner workings:

    User.name.property.columns[0]
    User.__table__
    User.__table__.c.name == 'ed'

But, you can write higher-level code such as:

    User.name == 'ed'

Query:

    query = session.query(User).filter(
        User.name == 'ed'
    ).order_by(User.id)
    query.all()

    for name, fullname in session.query(
        user.name, User.fullname
    ):
        ...

Create a dict really easily:

    d = dict(session.query(User.name, User))

Using order, limit, offset:

    session.query(User).order_by(...)[1:3]

filter is for full-blown expressions:

    (User.something == "something")

filter_by is for filtering by a specific field:

    (something == "something")

If you use multiple filters in a chain, they are anded.

.first() limits things to one row.

.one() asserts that there is one and only one row This may raise NoResultFound or MultipleResultsFound.

Examples:

    q = session.query(User.fullname).order_by(User.fullname)
    q.all()

    q2 = q.filter(or_(User.name == 'mary', User.name == 'ed'))
    print q2[1]

Advanced ORM Usage

Start with the same mappings as above.

What he showed above is a little verbose. There are shortcuts.

Unlike other ORMs, SQLAlchemy doesn't assume you're going to have a primary key named id. He was trying to make it appealing for existing databases.

It doesn't decide what the database should look like. This makes it really explicit.

There are ways to enforce conventions using mixins.

    class Address(Base):

        ...
        user_id = Column(Integer, ForeignKey('user.id'))

        # This gives you address.user and user.addresses.
        user = relationship("User", backref="addresses")

This stuff works with reflection too.

    jack.addresses = [
        Address(...),
        ...
    ]

You can do anything to jack.addresses that you can do with a list.

If you run session.add(jack), it adds everything else automatically.

This is where ORMs can save you a lot of time. They make persisting stuff much easier.

The collection stays in memory until the transaction ends.

There is a way to use eager loading to avoid the N+1 queries problem.

As soon as you begin working with the session, it starts a new transaction.

The only time you need to use session.begin() is if you explicitly turned on autocommit.

If you don't care about the transaction, just .close() it. Then the cursor pool will do whatever it does.

When he does a POST in a web app, he'll do an explicit commit().

SQLAlchemy uses a connection pool by default.

Collections and references are updated by manipulating objects, not primary / foreign key values.

An implicit join:

    session.query(User, Address).filter(
        User.id == Address.user_id
    ).all()

Without the filter, you get the cartesian product.

Here's an explicit join:

    session.query(User, Address).join(
        Address, User.id == Address.user_id
    ).all()

Using the relationship:

    session.query(User, Address).join(User.addresses).all()

Self join:

    from sqlalchemy.orm import aliased

    a1, a2 = aliased(Address), aliased(Address)
    session.query(User).join(a1).join(a2).filter(
        a1.email_address == '...'
    ).filter(a2.email_address == '...').all()

There's a group_by.

Add .subquery() to the end to make it a subquery:

    subq = session.query(
        func.count(Address.id).label('count'),
        User.id.label('user_id')
    ).join(Address.user).group_by(User.id).subquery()

    ...outerjoin(subq, User.id == subq.c.user_id)...

Here's how to avoid the N+1 problem:

    for user in session.query(User).options(
        subqueryload(User.addresses)
    ):
        print(user, user.addresses)

There's a really good chapter in the docs that covers this stuff.

Use Alembic for migrations.

People used to use Elixir on top of SQLAlchemy, but Elixir isn't being maintained anymore. Everything you can do in Elixir can be done using Declarative in SQLAlchemy.

No comments: