Wednesday, April 23, 2014

PyCon Notes: PostgreSQL Proficiency for Python People

In summary, this tutorial was fantastic! I learned more in three hours than I would have learned if I had read a whole book!

Here's the video. Here are the slides. Here are my notes:

Christoph Pettus was the speaker. He's from PostgreSQL Experts.

PostgreSQL is a rich environment.

It's fully ACID compliant.

It has the richest set of features of any modern, production RDMS. It has even more features than
Oracle.

PostgreSQL focuses on quality, security, and spec compliance.

It's capable of very high performance: tens of thousands of transactions per second, petabyte-sized data sets, etc.

To install it, just use your package management system (apt, yum, etc.). Those systems will usually take care of initialization.

There are many options for OS X. Heroku even built a Postgres.app that runs more like a foreground app.

A "cluster" is a single PostgreSQL server (which can manage multiple databases).

initdb creates the basic file structure. PostgreSQL has to be up and running to run initdb.

To create a database:

sudo su - postgres
psql

create database this_new_database;

To drop a database:

drop database this_new_database;

Debian runs initdb for you. Red Hat does not.

Debian has a cluster management system. Use it. See, for instance, pg_createcluster.

Always create databases as UTF-8. Once you've created it, you can't change it.

Don't use SQLASCII. It's a nightmare. Don't use "C locale".

pg_ctl is a built-in command to start and stop PostgreSQL:

cd POSTGRES_DIRECTORY
pg_ctl -D . start

Usually, pg_ctl is wrapped by something provided by your platform.

On Ubuntu, start PostgreSQL via:

service postgresql start

Always use "-m fast" when stopping.

Postgres puts its own data in a top-level directory. Let's call it $PGDATA.

Don't monkey around with that data.

pg_clog and pg_xlog are important. Don't mess with them.

On most systems, configuration lives in $PGDATA.

postgresql.conf contains server configuration.

pg_hba.conf contains authentication settings.

postgresql.conf can feel very overwhelming.

Avoid making a lot of changes to postgresql.conf. Instead, add the following to it:

include "postgresql.conf.include"

Then, mess with "postgresql.conf.include".

The important parameters fall into these categories: logging, memory, checkpoints, and the planner.

Logging:

Be generous with logging. It has a very low impact on the system. It's your best source of info for diagnosing problems.

You can log to syslog or log CSV to files. He showed his typical logging configuration.

He showed his guidelines / heuristics for all the settings, including how to finetune things. They're really good! See his slides.

As of version 9.3, you don't need to tweak Linux kernel parameters anymore.

Do not mess with fsync or  synchronous_commit.

Most settings require a server reload to take effect. Some things require a server restart. Some can be set on a per-session basis. Here's how to do that. This is also an example of how to use a transaction:

begin;
set local random_page_cost = 2.5;
show random_page_cost;
abort;

pg_hba.conf contains users and roles. Roles are like groups. They form a hierarchy.

A user is just a role with login privs.

Don't use the "postgres" superuser for anything application-related.

Sadly, you probably will have to grant schema-modification privs to your app user if you use migrations, but if you don't have to, don't.

By default, DB traffic is not encrypted. Turn on SSL if you are running in a cloud provider.

In pg_hba.conf, "trust" means if they can log into the server, they can access Postgres too. "peer" means they can have a Postgres user that matches their username. "md5" is an md5 hash password.

It's a good idea to restrict the IP addresses allowed to talk to the server fairly tightly.

The WAL

The Write-Ahead Log is key to many Postgres operations. It's the basis for replication, crash recovery, etc.

When each transaction is committed, it is logged to the write-ahead log.

Changes in the transaction are flushed to disk.

If the system crashes, the WAL is "replayed" to bring the DB to a consistent state.

It's a continuous record of changes since the last checkpoint.

The WAL is stored in 16MB segments in the pg_xlog directory.

Never delete anything from pg_xlog.

archive_command is a way to move the WAL segments to someplace safe (like a
different system).

By default, synchronous_commit is on, which means that commits do not return until the WAL flush is done. If you turn it off, they'll return when the WAL flush is queued. You might lose transactions in the case of a crash, but there's no risk of database corruption.

Backup and Recovery

Experience has shown that 20% of the time, your EBS volumes will not reattach when you reboot in AWS.

pg_dump is a built-in dump/restore tool.

It takes a logical snapshot of the database.

It doesn't lock the database or prevent writes to disk.

pg_restore restores the database. It's not fast.

It's great for simple backups but not suitable for fast recovery from major failures.

pg_bench is the built in benchmarking tool.

pg_dump -Fc --verbose example > example.dump

Without the -Fc, it dumps SQL commands instead of its custom format.

pg_restore --dbname=example_restored --verbose example.dump

pg_restore takes a long time because it has to recreate indexes.

pg_dumpall --globals-only

Back up each database with pg_dump using --format=custom.

To do a parallel restore, use --jobs=.

If you have a large database, pg_dump may not be appropriate.

A disk snapshot + every WAL segment is enough to recreate the database.

To start a PITR (point in time recovery) backup:

select pg_start_backup(...);

Copy the disk image and any WAL files that are created.

select pg_stop_backup();

Make sure you have all the WAL segments.

The disk image + all the WAL segments are enough to create the DB.

See also github.com/wal-e/wal-e. It's highly recommended.

It automates backups to S3.

He explained how to do a PITR.

With PITR, you can rollback to a particular point in time. You don't have to replay everything.

This is super handy for application failures.

RDS is something that scripts all this stuff for you.

Replication

Send the WAL to another server.

Keep the server up to date with the primary server.

That's how PostgreSQL replication works.

The old way was called "WAL Archiving". Each 16MB segment was sent to the secondary when complete. Use rsync, WAL-E, etc., not scp.

The new way is Streaming Replication.

The secondary gets changes as they happen.

It's all setup via recovery.conf in your $PGDATA.

He showed a recovery.conf for a secondary machine, and showed how to let it become the master.

Always have a disaster recovery strategy.

pg_basebackup is a utility for doing a snapshot of a running server. It's the easiest way to take a snapshot to start a new secondary. It's also useful for archival backups. It's not the fastest thing, but it's pretty foolproof.

Replication:

The good:

Easy to setup.

Schema changes are replicated.

Secondaries can handle read-only queries for load balancing.

It either works or it complains loudly.

The bad:

You get the entire DB cluster or none of it.

No writes of any kind to the secondary, not even temporary tables.

Some things aren't replicated like temporary tables and unlogged tables.

His advice is to start with WAL-E. The README tells you everything. It fixes a ton of problems.

The biggest problem with WAL-E is that writing to S3 can be slow.

Another way to do funky things is trigger-based replication. There's a bunch of third-party packages to do this.

Bucardo is one that lets you do multi-master setups.

However, they're fiddly and complex to set up. They can also fail quietly.

Transactions, MVCC, and Vacuum

BEGIN;
INSERT ...;
INSERT ...;
COMMIT;

By the way, no bank works this way ;)

Everything runs inside of a transaction.

If there is no explicit transaction, each statement is wrapped in one for you.

Everything that modifies the database is transactional, even schema changes.

\d shows you all your tables.

With a transaction, you can even rollback a table drop.

South (the Django migration tool) runs the whole migration in a single transaction.

Many resources are held until the end of a transaction. Keep your transactions brief and to the point.

Beware of "IDLE IN TRANSACTION" sessions. This is a problem for Django apps.

A tuple in Postgres is the same thing as a row.

Postgres uses Multi-Version Concurrency Control. Each transaction sees its own version of the database.

Writers only block writers to the same tuple. Nothing else causes blocking.

Postgres will not allow two snapshots to "fork" the database. If two people try to write to the same tuple, Postgres will block one of them.

There are higher isolation modes. His description of them was really interesting.

He suggested that new apps use SERIALIZABLE. This will help you find the concurrency errors in your app.

Deleted tuples are not usually immediately freed.

Vacuum's primary job is to scavenge tuples that are no longer visible to any transaction.

autovacuum generally handles this problem for you without intervention (since version 8).

Run analyze after a major database change to help the planner out.

If someone tells you "vacuum's not working", they're probably wrong.

The DB generally stabilizes at 20% to 50% bloat. That's acceptable.

The problem might be that there are long-running transactions or idle-in-transaction sessions. They'll block vacuuming. So will manual table locking.

He talked about vacuum issues for rare situations.

Schema Design

Normalization is important, but don't obsess about it.

Pick "entities". Make sure that no entity-level info gets pushed into the subsidiary items.

Pick a naming scheme and stick with it.

Plural or singular? DB people tend to like plural. ORMs tend to like singular.

You probably want lower_case to avoid quoting.

Calculated denormalization can sometimes be useful; copied denormalization is almost never useful.

Joins are good.

PostgreSQL executes joins very efficiently. Don't be afraid of them.

Don't worry about large tables joined with small tables.

Use the typing system. It has a rich set of types.

Use domains to create custom types.

A domain is a core type + a constraint.

Don't use polymorphic fields (fields whose interpretation is dependent on another field).

Don't use strings to store multiple types.

Use constraints. They're cheap and fast.

You can create constraints across multiple columns.

Avoid Entity-Attribute-Value schemas. They cause great pain. They're very inefficient. They make reports very difficult.

Consider using UUIDs instead of serials as synthetic keys.

The problem with serials for keys is that merging tables can be hard.

Don't have "Thing" tables like "Object" tables.

If a table has a few frequently-updated fields and a few slowly-updated fields, consider splitting the table. Split the fast-moving stuff out into a separate 1-to-1 table.

Arrays are a first-class type in PostgreSQL. It's a good substitute for using a subsidiary table.

A list of tags is a good fit for arrays.

He talked about hstore. It's much better than Entity-Attribute-Value. It's great for optional, variable attributes. It's like a hash. It can be indexed, searched, etc. It lets you add attributes to tables for users. Don't use it as a way to avoid all table modifications.

json is now a built in type.

There's also jsonb.

Avoid indexes on big things, like 10k character strings.

NULL it a total pain in the neck.

Only use it to mean "missing value".

Never use it to represent a meaningful value.

Let's call anything 1MB or more a "very large object". Store them in files. Store the metadata in the database. The database API is just not a good fit for this.

Many-to-many tables can get extremely large. Consider replacing them with array fields (either one way or both directions). You can use a trigger to maintain integrity.

You don't want more than about 250k entries in an array.

Use UTF-8. Period.

Always use TIMESTAMPTZ (which Django uses by default). Don't use TIMESTAMP. TIMESTAMPTZ is a timestamp converted to UTC.

Index types:

B-Tree

Use a B-Tree on a column if you frequently query on that column,
use one of the comparison operators, only get back 10-15% of the rows,
and run that query frequently.

It won't use the index if you're going to get back more than 15% of
the rows because it's faster to scan a table then scan an index.

Use a partial index if you can ignore most of the rows.

The entire tuple has to be copied into the index.

GiST

It's a framework to create indexes.

KNN indexes are the K-nearest neighbors.

GIN

Generalized inverted index. Used for full-text search.

The others either are not good or very specific.

Why isn't it using my index?

Use explain analyze to look at the query.

If it thinks it's going to require most of the rows, it'll do a table scan.

If it's wrong, use analyze to update the planner stats.

Sometimes, it can't use the index.

Two ways to create an index:

create index

create index concurrently

reindex rebuilds an index from scratch.

pg_stat_user_indexes tells you about how your indexes are being used.

What do you do if a query is slow:

Use explain or explain analyze.

explain doesn't actually run the query.

"Cost" is measured in arbitrary units. Traditionally, they have been "disk fetches". Costs are inclusive of subnodes.

I think explain analyze actually runs the query.

Things that are bad:

Joins between 2 large tables.

Cross joins (cartesian products). These often happen by accident.

Sequential scans on large tables.

select count(*) is slow because it results in a full table scan since you
have to see if the tuples are alive or dead.

offset / limit. These actually run the query and then throw away that many
rows. Beware that GoogleBot is relentless. Use other keys.

If the database is slow:

Look at pg_stat_activity:

select * from pg_stat_activity;

tail -f the logs.

Too much I/O? iostat 5.

If the database isn't responding:

Try connecting with it using psql.

pg_stat_activity

pg_locks

Python Particulars

psycopg2 is the only real option in Python 2.

The result set of a query is loaded into client memory when the query completes. If there are a ton of rows, you could run out of memory. If you want to scroll through the results, use a "named" cursor. Be sure to dispose of it properly.

The Python 3 situation is not so great. There's py-postgresql. It's pure Python.

If you are using Django 1.6+, use the @atomic decorator.

Cluster all your writes into small transactions. Leave read operations outside.

Do all your writes at the very end of the view function.

Multi-database works very nicely with hot standby.

Point the writes at the primary, and the reads at the secondary.

For Django 1.5, use the @xact decorator.

Sloppy transaction management can cause the dreaded Django idle-in-transaction problem.

Use South for database migration. South is getting merged into Django in version 1.7 of Django.

You can use manual migrations for stuff the Django ORM can't specify.

Special Situations

Upgrade to 9.3.4. Upgrade minor versions promptly.

Major version upgrades require more planning. pg_upgrade has to be run when the database is not running.

A full pg_dump / pg_restore is always the safest, although not the most practical.

Always read the release notes.

All parts of a replication set must be upgraded at once (for major versions).

Use copy, not insert, for bulk loading data. psycopg2 has a nice interface. Do a vacuum afterwards.

AWS

Instances can disappear and come back up without instance storage.

EBS can fail to reattach after reboot.

PIOPS are useful (but pricey) if you are using EBS.

Script everything, instance creation, PostgreSQL, etc. Use Salt. Use a VPC.

Scale up and down as required to meet load. If you're just using them to rent a server, it's really expensive.

PostgreSQL RDS is a managed database instance. Big plus: automatic failover! Big minus: you can't read from the secondary. It's expensive. It's a good place to start.

Sharding

Eventually, you'll run out of write capacity on your master.

postgres-xc is an open source fork of PostgreSQL.

Bucardo provides multi-master write capability.

He talked about custom sharding.

Instagram wrote a nice article about it.

Pooling

Opening a connection is expensive. Use a pooler.

pgbouncer is a pooler.

pgPool II can even do query analysis. However, it has higher overhead and is more complex to configure.

Tools

Monitor everything.

check_postgres.pl is a plugin to monitor PostgreSQL.

pgAdmin III and Navicat are nice clients.

pgbadger is for log analysis. So is pg_stat_statements.

Closing

MVCC works by each tuple having a range of transaction IDs that can see that
tuple.

Failover is annoying to do in the real world. People use HAProxy, some pooler, etc. with some scripting, or they have a human do the failover.

HandyRep is a server-based tool designed to allow you to manage a PostgreSQL "replication cluster", defined as a master and one or more replicas on the same network.

No comments: