Uh-oh! something just happened
One of these queries might quickly narrow down a problem with your Postgres database. With a little luck, the culprit has to do with either: queries, indexes, or disk space. Postgres exposes a lot of internal information in system tables like
pg_stat_user_tables, if you are not already familiar with them we will be looking at a few as we go:
- Long running queries?
- What is blocking my lock?
- Stop processes
- Tables need vacuum?
- Missing index?
- Slow vacuum?
- Database size?
- Table size?
- Index size?
pg_stat_activity has information on currently running processes, and in our case - queries. We can get all long-running queries based on
query_start, including the SQL that is being run and additional data about what the process is doing.
Long running queries?
SELECT pid, usename, datname, state, NOW() - pg_stat_activity.query_start AS duration, wait_event_type, wait_event, query FROM pg_stat_activity WHERE (NOW() - pg_stat_activity.query_start) > interval '1 minutes' ORDER BY duration DESC;
There are a few cases you are likely to find here:
- A query like
SELECT * FROM big_tablehas been running since an incident started
- A query like
SELECT * FROM big_table WHERE should_have_an_indexis missing an index
- A query is stuck on locks, the
wait_event_typecolumns have information which could indicate locking bottlenecks (info on what the field values mean)
- Lots of idle queries, filter them out with
WHERE state <> 'idle'unless something may be opening more idle connections than we want
In the case of locks being the bottleneck, we can use
pg_blocking_pids to walk backward and what is responsible for queries stumbling over each other locks.
What is blocking my lock?
SELECT * FROM pg_stat_activity WHERE pid IN (SELECT pg_blocking_pids(<pid of blocked query>));
To take action use either
pg_terminate_backend, the former
tries to cancel the running query and if that does not work the latter will terminate
SELECT pg_cancel_backend(<pid>); SELECT pg_terminate_backend(<pid>);
pg_stat_user_tables shows system information about tables. We can use this to see if our autovacuum or indexing strategy is not working for us.
Tables need vacuum?
SELECT schemaname, relname, n_dead_tup, n_live_tup, n_dead_tup / n_live_tup AS percent_dead_tuples FROM pg_stat_user_tables WHERE n_live_tup > 0 ORDER BY n_dead_tup DESC;
A high number of
percent_dead_tuples likely means our autovacuum is not running often enough on this table. This can cause queries and indexes to be less effective and slower, the solution is to usually autovacuum more frequently. The rest of the columns in the table can also help with identifying where to look next:
SELECT schemaname, relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del, n_tup_hot_upd, n_mod_since_analyze, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze, vacuum_count, autoanalyze_count, autovacuum_count FROM pg_stat_user_tables;
Some common issues:
- Infrequent vacuums or analyzes.
last_autoanalyzewill tell you if a table has not been vacuumed or analyzed recently. Likely the case for tables with large
n_dead_tup. Can also be paired with
autoanalyze_countto see if it is running frequently
- Missing indexes. From high
idx_scan. For healthy indexes we want most of our reads to not be coming from seq scans. There are some cases where you can sequentially walk the whole table without a problem (e.g.
SELECT * FROM table_with_10_rowsneeds no index)
- Not performing hot updates. On high usage tables with low
n_tup_hot_upd. With lots of updates, there are big performance improvements for more hot tuple updates. The fix here might live in the application on determining how to group affected rows together more easily
The opposite of a missing index could be too many indexes slowing down updates. If an index has a low hit rate, we are doing an extra index update without any of the benefits.
SELECT schemaname, relname, indexrelname, idx_scan FROM pg_stat_user_indexes
Another potential issue is autovacuum is running but just taking too long.
SELECT datname, usename, pid, state, wait_event, current_timestamp - xact_start AS xact_runtime, query FROM pg_stat_activity WHERE upper(query) LIKE '%VACUUM%' ORDER BY xact_start;
At the database level, the vacuum needs periodically run to clean up transaction ids for postgres to continue running. If you want to check this value before it gets too large (~2 billion):
SELECT max(age(datfrozenxid)) FROM pg_database;
There is a good article from amazon on setting up alarms to monitor this in AWS. If it is climbing or triggers an alarm we can narrow down the area if there are multiple databases:
SELECT datname, age(datfrozenxid) AS frozen_age, age(datfrozenxid)::decimal / 1000000000 AS frozen_age_percent FROM pg_database ORDER BY frozen_age desc limit 20;
pg_stat_database stores database-wide statistics, and we will be looking at
pg_catalog for table-level statistics.
Do we have a disk space problem? Let’s work from the top down.
We can get our
current_database() and check there are no size surprises:
pg_catalog.pg_stat_user_tables from tracking down table-level index issues, but this time for disk space:
SELECT schemaname AS table_schema, relname AS table_name, pg_size_pretty(pg_relation_size(relid)) AS data_size FROM pg_catalog.pg_stat_user_tables ORDER BY pg_relation_size(relid) desc;
pg_catalog.pg_stat_user_indexes from index investigation, but now for disk space:
SELECT schemaname AS table_schema, relname AS table_name, indexrelname AS index_name, idx_scan AS id_scans, pg_size_pretty(pg_relation_size(relid)) AS data_size FROM pg_catalog.pg_stat_user_indexes ORDER BY pg_relation_size(relid) DESC;
Disk space issues might be coming from temporary files. When Postgres attempts to sort a very large set without sufficient memory it uses temp files. A query like
SELECT * FROM large_table ORDER BY column_with_no_index will try loading the dataset into memory to sort and then use disk:
SELECT datname, temp_files, temp_bytes FROM pg_stat_database ;
Hopefully, some of this helps, let me know if there are any others you use!