Linux x86-64 (RHEL 8) 14 SQL Profiler
Kuntal Ghosh Senior Software Engineer Dec 9, 2019
SUMMARY: This article describes a problem in PostgreSQL where logical replication can cause a server to run out of memory. It explains the cause of the problem, demonstrates the severity of the problem with an example, then provides a solution introduced in PostgreSQL 13, the logical_decoding_work_mem parameter.
1. The problem
2. The severity of the problem
3. The solution
PostgreSQL 10 introduced logical replication that opens up the opportunity for implementing various important database features, including multi-master replication, bi-directional replication, replication among Postgres and other databases, and more. Since then, a lot of work has been done to improve various aspects of logical replication, and PostgreSQL 13 is no exception. In this release, we've addressed an issue in the memory management of logical replication that might cause your server to run out of memory.
In an earlier blog, I discussed the basic architecture of logical replication, with some examples to show how easy it is to start with. In short, logical replication in Postgres is a publish-subscribe model, where a subscriber node can subscribe to a publisher and receive logical changes from the publisher in real time. When a subscription is created, a walsender process is spawned in the publisher node. The walsender process is responsible for reading the WAL one-by-one and reassembles the changes belonging to each transaction. When a transaction commits, it decodes the corresponding changes and sends them to the subscribers. When a transaction aborts, the walsender simply discards the changes.
The walsender process maintains an in-memory hash table to track the changes of each ongoing transaction, where the transaction id is the key and the changes are appended in the corresponding hash entry. Once the changes are sent to the subscriber, it deletes the hash entry and frees the memory. It is easy to see that we need some kind of mechanism to place a memory bound on the number of changes and spill the rest of the data to disk. Until now, whenever a transaction or sub-transaction has accumulated 4096 changes, the respective changes are spilled to disk. Whenever a top-level transaction (i.e., the root of a tree of transactions formed from a nested transaction) commits, it reads back all the changes from the disk before processing it. You can see how quickly the walsender can pile up a huge amount of memory.
The severity of the problem
Peter Eisentraut has explained the severity of the problem in a subtle way:
"Let's say a change entry takes 100 bytes (it might contain a new row, or an update key and some new column values, for example). If you have 100 concurrent active sessions and no sub-transactions, then logical decoding memory is bounded by 4096 * 100 * 100 = 40 MB (per walsender) before things spill to disk.
Now let's say you are using a lot of sub-transactions, because you are using PL functions, exception handling, triggers, doing batch updates. If you have 200 sub-transactions on average per concurrent session, the memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB (per walsender). And so on. If you have more concurrent sessions or larger changes or more sub-transactions, you'll use much more than those 8 GB. And if you don't have those 8 GB, then you're stuck at this point."
Here are some steps to reproduce the problem. First, let's create a table in the publisher node and publish the table:
postgres=# create table t1(a text); CREATE TABLE postgres=# create publication my_pub for table t1; CREATE PUBLICATION
Next, configure the subscriber node:
postgres=# create table t1(a text); CREATE TABLE postgres=# CREATE SUBSCRIPTION my_sub CONNECTION 'host=localhost port=54321 dbname=postgres' PUBLICATION my_pub; NOTICE: created replication slot "my_sub" on publisher CREATE SUBSCRIPTION
Now, I’ve used the following python script to simulate the scenario:
#!/bin/python import pg8000 # declare an empty connection array conn =  num_trans = 100 num_subtrans = 200 # create multiple connections for x in range(num_trans): conn.append(pg8000.connect('edb', 'localhost', None, 54321, 'postgres', 'password')) # open some subtransactions from each connection and perform 4000 inserts from each # subtransaction for x in range(num_trans): conn[x].autocommit=False cursor = conn[x].cursor() print "Running transaction: ", x for y in range(num_subtrans): cursor.execute("INSERT INTO t1 (a) SELECT (repeat(%s, 100)) FROM generate_series(1, 4000)", ('a')) cursor.execute("SAVEPOINT a1") input = raw_input("Should we commit? (y/n) : ") if input.lower() in ['y', 'yes']: for x in range(num_trans): cursor = conn[x].cursor() cursor.execute("commit") else: for x in range(num_trans): cursor = conn[x].cursor() cursor.execute("rollback")
Once you run this script, use the top -p <walsender pid> to monitor the memory usage of the walsender process in the publisher node. You would see the walsender process accumulates a huge amount of memory (GBs) in no time.
In PostgreSQL 13, this commit (Author: Tomas Vondra, with changes by Dilip Kumar and Amit Kapila) adds a configuration parameter, logical_decoding_work_mem, which controls the amount of memory allocated to a walsender process to keep the changes in memory before spilling the same to disk. You can decrease the value to reduce the memory usage of walsenders or set a higher value to reduce disk writes. The logical_decoding_work_mem parameter can be set in postgresql.conf, where it serves as the default for all publishers on that instance. This parameter doesn’t require any restart of the server, but if it is set in postgresql.conf, you’ll need to reload the config files. The default value is 64MB. With this setting, if you run the above example, you'll observe that the memory of the walsender process is nicely bounded. It would still allocate some additional memory per transaction to maintain the in-memory data structures, but it solves the actual problem.