GitLab, thanks for using PostgreSQL 9.6 and its replication and backup facilities.
We’re sorry that you lost your database:
http://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/
Thank you for posting this publicly to allow us to comment on this for your postmortem analysis.
I’m very happy that you monitor Replication Lag, that is good. Replication lag of 4GB is at times normal, so shouldn’t have caused major concern. I’ve recently fixed a bug in replication that caused replication to hang in some cases for up to a minute; we released a public fix to that and it will be included in the next maintenance release of PostgreSQL 9.6. It’s not certain that the bug was hit and, if it was, whether that was enough to cause the slow down noted. The openness of your response means we should do the same also, so I’m mentioning this issue here for that reason.
Restarting replication was probably unnecessary but if you shutdown the standby node cleanly then replication should disconnect gracefully. If replication is disconnected gracefully we would expect it to release the WALSender used, so if you reconnect then there should now be a free WALSender connection available. There should be no need to alter the number of max_wal_senders parameter in normal usage. So it appears that the standby node was violently shut down preventing it from normal clean shutdown of the replication connection.
The lack of a clean shutdown meant that the walsender connection wasn’t released quickly enough and the new connection tried to use an additional connection, which then hit the limit of max_wal_senders, causing it to need to be increased. Not sure what max_wal_senders was before but it seems it shouldn’t have needed to be as high as 32 – certainly 2-4 would be normal. max_connections = 8000 is a massively high figure, so reducing it to 2000 makes sense anyway.
pg_basebackup first issues a checkpoint to ensure it gets the latest data for the backup, which by default is a paced checkpoint to avoid hitting the disks too hard. As a result it can take a few minutes, typically about 4 minutes with default settings to start up. During that time the pg_basebackup process is quiet while it waits for the checkpoint on the master to complete.
In the sequence of actions I see no new PostgreSQL bugs or usage problems with the software. Some issues were encountered but we have publicly available tools to ensure things work correctly.
Removing a data directory by manual actions is dangerous, as this proved. Managing replication using scripts is normal, so a test in the script to check that the replication is being set up on a standby rather than on the master would have trapped that error. We recommend the use of the repmgr tool for this purpose. “https://www.2ndquadrant.com/”/resources/repmgr/ repmgr 3.3 was released 6 Jan 2017 and is maintained regularly.
Recovering from backup is obviously critical, which is why we provide the barman tool for managing backups. Barman works with S3 and is fully tested to ensure it works when things go bad. “https://www.2ndquadrant.com/”/resources/barman/ Barman 2.1 was released 5 Jan 2017 and is maintained regularly.
I’ve been spreading the meme that you should test your backups by practicing recovery from them for some years and we teach that on our courses and at conferences. I recommend the use of repmgr and barman for managing replication and backup respectively.
We’re sorry for any frustration you may have experienced and we’re open to suggestions for how to improve PostgreSQL.
2ndQuadrant is the initial author of core PostgreSQL’s backup technologies and has extended the backup and replication technologies continuously since version 8.0. 2ndQuadrant provides 24/7 Support for PostgreSQL in the event of emergencies, as well as training worldwide.
Simon Riggs, CTO 2ndQuadrant & PostgreSQL Committer