Quickly debugging Postgres problems
15 August 2020
Uh-oh! something just happened
One of these queries might quickly narrow down a problem with your Postgres database. With a little luck, the culprit has to do with either: queries, indexes, or disk space. Postgres exposes a lot of internal information in system tables like pg_stat_activity
and pg_stat_user_tables
, if you are not already familiar with them we will be looking at a few as we go:
- Queries,
pg_stat_activity
- Long running queries?
- What is blocking my lock?
- Stop processes
- Indexes,
pg_stat_user_tables
&pg_stat_user_indexes
- Tables need vacuum?
- Missing index?
- Slow vacuum?
- Disk Space,
pg_stat_database
&pg_relation_size
- Database size?
- Table size?
- Index size?
Queries
pg_stat_activity
has information on currently running processes, and in our case - queries. We can get all long-running queries based on query_start
, including the SQL that is being run and additional data about what the process is doing.
Long running queries?
SELECT
pid,
usename,
datname,
state,
NOW() - pg_stat_activity.query_start AS duration,
wait_event_type,
wait_event,
query
FROM pg_stat_activity
WHERE (NOW() - pg_stat_activity.query_start) > interval '1 minutes'
ORDER BY duration DESC;
There are a few cases you are likely to find here:
- A query like
SELECT * FROM big_table
has been running since an incident started - A query like
SELECT * FROM big_table WHERE should_have_an_index
is missing an index - A query is stuck on locks, the
wait_event
andwait_event_type
columns have information which could indicate locking bottlenecks (info on what the field values mean) - Lots of idle queries, filter them out with
WHERE state <> 'idle'
unless something may be opening
more idle connections than we want
In the case of locks being the bottleneck, we can use pg_blocking_pids
to walk backward and what is responsible for queries stumbling over each other locks.
What is blocking my lock?
SELECT * FROM pg_stat_activity
WHERE pid IN (SELECT pg_blocking_pids(<pid of blocked query>));
To take action use either pg_cancel_backend
or pg_terminate_backend
, the former
tries to cancel the running query and if that does not work the latter will terminate
the connection.
Stop processes
SELECT pg_cancel_backend(<pid>);
SELECT pg_terminate_backend(<pid>);
Indexes
pg_stat_user_tables
shows system information about tables. We can use this to see if our autovacuum or indexing strategy is not working for us.
Tables need vacuum?
SELECT
schemaname,
relname,
n_dead_tup,
n_live_tup,
n_dead_tup / n_live_tup AS percent_dead_tuples
FROM pg_stat_user_tables WHERE n_live_tup > 0
ORDER BY n_dead_tup DESC;
A high number of n_dead_tup
or percent_dead_tuples
likely means our autovacuum is not running often enough on this table. This can cause queries and indexes to be less effective and slower, the solution is to usually autovacuum more frequently. The rest of the columns in the table can also help with identifying where to look next:
SELECT
schemaname,
relname,
seq_scan,
seq_tup_read,
idx_scan,
idx_tup_fetch,
n_tup_ins,
n_tup_upd,
n_tup_del,
n_tup_hot_upd,
n_mod_since_analyze,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze,
vacuum_count,
autoanalyze_count,
autovacuum_count
FROM pg_stat_user_tables;
Some common issues:
- Infrequent vacuums or analyzes.
last_autovacuum
andlast_autoanalyze
will tell you if a table has not been vacuumed or analyzed recently. Likely the case for tables with largen_dead_tup
. Can also be paired withautovacuum_count
andautoanalyze_count
to see if it is running frequently - Missing indexes. From high
seq_scan
and lowidx_scan
. For healthy indexes we want most of our reads to not be coming from seq scans. There are some cases where you can sequentially walk the whole table without a problem (e.g.SELECT * FROM table_with_10_rows
needs no index) - Not performing hot updates. On high usage tables with low
n_tup_hot_upd
. With lots of updates, there are big performance improvements for more hot tuple updates. The fix here might live in the application on determining how to group affected rows together more easily
The opposite of a missing index could be too many indexes slowing down updates. If an index has a low hit rate, we are doing an extra index update without any of the benefits.
Missing index?
SELECT
schemaname,
relname,
indexrelname,
idx_scan
FROM pg_stat_user_indexes
Another potential issue is autovacuum is running but just taking too long.
Slow vacuum?
SELECT
datname,
usename,
pid,
state,
wait_event,
current_timestamp - xact_start AS xact_runtime,
query
FROM pg_stat_activity
WHERE upper(query) LIKE '%VACUUM%'
ORDER BY xact_start;
At the database level, the vacuum needs periodically run to clean up transaction ids for postgres to continue running. If you want to check this value before it gets too large (~2 billion):
SELECT max(age(datfrozenxid)) FROM pg_database;
There is a good article from amazon on setting up alarms to monitor this in AWS. If it is climbing or triggers an alarm we can narrow down the area if there are multiple databases:
SELECT
datname,
age(datfrozenxid) AS frozen_age,
age(datfrozenxid)::decimal / 1000000000 AS frozen_age_percent
FROM pg_database
ORDER BY frozen_age desc limit 20;
Disk space
pg_stat_database
stores database-wide statistics, and we will be looking at pg_catalog
for table-level statistics.
Do we have a disk space problem? Let’s work from the top down.
Database size?
We can get our current_database()
and check there are no size surprises:
SELECT pg_size_pretty(pg_database_size(current_database()));
Table size?
Using pg_catalog.pg_stat_user_tables
from tracking down table-level index issues, but this time for disk space:
SELECT schemaname AS table_schema,
relname AS table_name,
pg_size_pretty(pg_relation_size(relid)) AS data_size
FROM pg_catalog.pg_stat_user_tables
ORDER BY pg_relation_size(relid) desc;
Index size?
Using pg_catalog.pg_stat_user_indexes
from index investigation, but now for disk space:
SELECT schemaname AS table_schema,
relname AS table_name,
indexrelname AS index_name,
idx_scan AS id_scans,
pg_size_pretty(pg_relation_size(relid)) AS data_size
FROM pg_catalog.pg_stat_user_indexes
ORDER BY pg_relation_size(relid) DESC;
Disk space issues might be coming from temporary files. When Postgres attempts to sort a very large set without sufficient memory it uses temp files. A query like SELECT * FROM large_table ORDER BY column_with_no_index
will try loading the dataset into memory to sort and then use disk:
SELECT datname, temp_files, temp_bytes
FROM pg_stat_database ;
Hopefully, some of this helps, let me know if there are any others you use!