May 14, 2020
Performance complaints are not entirely new to a DBA. With the large number of possible parameters and varying loads which can affect performance, it can be a mind-numbing affair to narrow things down and arrive at a solution.
One such case which we had come across lately forced us to look away from the usual suspects (Memory and OS) and take a deeper dive into actual user actions. This blog is a review of what we had seen and how we went about arriving at a solution.
Another day in the office was made exciting by a client calling in, explaining a problem they faced with slow performance which was severely affecting his team’s business goals. We started of with the usual line of investigation:
What is slow?
- Single query
- Multiple queries
- Queries to a certain relation or a group of relations
The answer? “Multiple queries.”
So, we looked at the usual suspects first to rule them out.
- Check the load of the machine
- Measure the network latency
These two constraints often account for a fair number of performance issues we see our customers reporting, and our goal is to eliminate as many obvious environmental factors as we can while we work to isolate the root cause.
What’d we find? The load was high on the machine, but the network latency seemed in line. So we looked toward the parameters in both PostgreSQL and the operating system to see what was in use.
- Memory Parameter in Postgres:
- OS settings
Interestingly enough, all four seemed to be set to optimal values based on their load profile and machine size. Except the problem hasn’t always been there, so what changed?
We went back to the user looking for clues. “Was there a change in application? Any large changes to the dataset?” No and no, at least not to their knowledge.
The only option for us was to do a deep dive into the database logs, and try to make sense of what was happening. If you’re familiar with Postgres, you know that the logs are there for the reading, if you have them configured to the right level to capture the details you need to ferret out problems like this. It looked like they did, so we helped them make use of a common query monitoring tool, PgBadger.
PgBadger Log Analyzer for PostgreSQL Query Performance Issues
PgBadger is a PostgreSQL log analyzer with fully detailed reports and graphs. It is open source and is considered lightweight, so where this customer didn’t have access to a more powerful tool like Postgres Enterprise Manager, PGBadger fit the bill. While it doesn’t have the depth, probe capability or tool integration that PEM does, it’s particularly useful in helping you to analyze your slowest queries, and also does a good job of profiling Temporary Files, Connection Peak, Autovacuum/Autoanalyze, Checkpoints, etc which can help in demonstrating overall health of the database.
The higher level structure of the report:
- Global Stats
- Queries: total volume, duration
- Events : the volume of events, the number of normalized events
- Vacuum: the number of autovacuum and auto-analyzes
- Temporary Files: the total number of temp files, as well as the average and max size of these files
- Connections: total connections, peak load of concurrent connections, and the total number of databases to which these are connected
- Sessions: total sessions, peak timing of sessions, the total time of these sessions, average length of sessions, as well as the average queries per session and the average duration of each query.
- Traffic profile: trend of SQL traffic, SELECT traffic, and INSERT/UPDATE/DELETE traffic over time expressed as mins, maxes and average
The reports may or may not show a graph depending on the parameters, as they’re driven by how you’ve configured your postgresql.conf file, but we’ll get into that in just a moment.
The great thing about these reports is that not only does it give you a graphical view of your queries, but
You can download each report separately using the Download tab below each graph, as shown in the example graph here:
In order to find the data from a graph, you can check where the graph is at peak and track from there.
PgBadger and the postgresql.conf
We’d mentioned above that there are some changes you have to make in order to make PGBadger work well for you. Like what? A brief word on that can be found on the github project for PGBadger itself, with an outline of all changes like log_checkpoints, log_lock_waits, or log_autovacuum_min_duration.
Some of the key changes begin with log_min_duration_statement.
log_min_duration_statement = 0
While you should typically have this set to 0, you may want to increase this value so as to only log queries with a longer duration in order to burden the system less. It’s worth noting, if you have log_statement set to 'all', nothing will be logged through the log_min_duration_statement directive.
Log line prefix for stderr output should look something like:
log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h'
The rest of the basic changes you’ll have to make are fairly straightforward, and help you to ensure that the logs provide all the data needed to paint an appropriate picture of trending behaviors:
log_checkpoints = on log_connections = on log_disconnections = on log_lock_waits = on log_temp_files = 0 log_autovacuum_min_duration = 0 log_error_verbosity = default
For a full view of any changes you may want to make, you can find the deeper dive here.
Once the customer had pgbadger in place and they’d made all necessary changes to their postgresql.conf, the reports it generated gave us three trends to allow us a way to identify the cause of the behavior:
- Checkpoint overload: We found that there was an above-average frequency of checkpoints being taken. A high frequency of checkpoints is a strong indicator that there was significant amount of change occurring to the existing records in the database; this wasn’t just about inserts or updates, but constant updates across many records throughout the database.
- High connection load: We saw a sustained spike in the number of concurrent active sessions, which was sharply higher than what was trending in the weeks before.
- Query contention: Locks on certain relations meant there was a backlog of queries waiting for these locks to be released.
The three major factors should help to underscore why the configuration of your .conf file is so important to proper use of PGBadger— had we not done so, none of these would have come to the surface.
With these signs in hand, we worked with the client to discuss the findings so far.
The cumulative effect of checkpoints, parallel sessions and locking were having a domino effect on the query execution time. Also, large amounts of data changes were causing stats on the table to get outdated quickly till autovacuum could update them which would lead to a change in query execution plans.
As it turned out, there was a year-end peak load in their own end-customer activities, which led to an increase in invoicing activities on their application. Sure, the system was tuned properly to work for a steady state usage, but the peak activity and its additional load was never part of that original equation.
The Solution for the Postgres Query Performance
We implemented the following plan:
- Increasing the checkpoint_completion_target We had them change the value to .9. Doing so helped to space out the checkpoints more evenly, which allowed for a reduction of the overall I/O load.
- Increase maintenance_work_mem Increasing their maintenance_work_mem value meant that there was more room given to autovacuum to complete faster, which meant statistics would be more current and the plans more accurate.
- Perform a VACUUM ANALYSE. Performing a VACUUM ANALYZE makes for query statistics which are more accurate; this can reduce query completion time and in turn, the locking events they experienced.
Once the action plan was complete, the customer happily reported that things were back to normal and performing as they needed it to. PgBadger’s reports, with the proper analysis and configuration, gave us and the customer the insights necessary to get them back to good health.
Join Postgres Pulse Live!
This problem, like many others we’ve outlined in past publications, is a challenge we see our customers, and Postgres users at large, face regularly. Our goal is to continue to provide meaningful insights here to help you understand not just the answers to your problems, but how we arrive at a solution. Looking for more help just like this? Here you can find all things Postgres Pulse, including all of our blog posts and our YouTube series.
Join us on Monday, May 18th, for our next Pulse Live Session! We’ll be bringing the usual mash-up of questions as they relate to not just query performance woes, but overall log-chasing and trend-tooling for the Postgres user. You can ask your questions via email at email@example.com, hashtag on Twitter, or live during the event.
This blog is co-authored by Mohini Ogale & Amit Sharma.