This article offers tips for improving Online Analytical Processing (OLAP) performance for PostgreSQL.
- Sizing up the Situation
- Schema Planning
- Database Tuning
- Hardware Selection
As the world’s most advanced open-source database, PostgreSQL is very versatile. With applications ranging from embedded systems to large-scale multi-terabyte NoSQL data stores, a PostgreSQL deployment can be tweaked to suit just about any situation. In this article, we’ll cover some simple tips for making sure PostgreSQL handles your Online Analytical Processing (OLAP) workload without skipping a beat.
Sizing up the Situation
When it comes to working with OLAP, the mindset is very different from what the vast majority of developers and DBAs are trying to do—get the database to serve up thousands of CRUD operations (create, read, update, delete) per second to a horde of users, while figuring out how to scale up as demand for resources increase. With OLAP, the number of users is typically going to remain very low, but the kinds of queries that are run will be very complex. Some of these queries can run for hours to help the OLAP application mine data for insights and make predictions. With this in mind, DBAs can typically improve PostgreSQL’s handling of this workload by using a combination of three approaches:
- Schema planning
- Database tuning
- Hardware selection
Someone once said, “Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” Following that advice means that we ought to spend a significant amount of time trying to understand the OLAP application, how it works, and what queries it generally uses, then configuring our schema and procedures to efficiently send data back to the application. For example, since much OLAP work involves date-driven, location-driven, or otherwise partitionable data, it’s highly recommended for fact tables to be partitioned in a way to allow the OLAP application to easily access the data without any additional legwork.
Another way to leverage schema planning to give clean and usable data to the OLAP application is by creating materialized views. PostgreSQL allows for the creation of materialized views, which can in turn be indexed for maximum query performance. What’s nice is that once data is ingested (say, every night), the view can be refreshed with a simple REFRESH MATERIALIZED VIEW command, and the data will be ready for the OLAP application to pull.
Just as a reminder, the general principles for performant schema design apply here as well—make sure you have primary keys on your fact and dimension tables. As you develop and test, print out your longer-running queries by setting log_min_duration_statement to 10000 or higher, and be sure to index columns that are often used for ORDER BY, WHERE, and GROUP BY clauses for those slower queries.
In addition to organizing your schema well, be sure to leverage PostgreSQL’s syntax features by writing procedures that take advantage of window functions and table expressions. Some have also opted to leverage PostgreSQL’s extensive implementation of Foreign Data Wrappers (FDW) to distribute data across several servers and thereby offset some of the computational work in aggregating data.
It is important to remember that for an OLAP workload, there are going to be generally fewer users connecting to the database at once. For this reason, many per-session parameters can be tuned so that each active session gets a bigger share of the allocable memory. For example, be sure to tune max_connections down very low, perhaps even below 25, depending on what your usage footprint will look like; you may even want to look into a connection pooler, like pgpool or pgbouncer, to allow several client sessions to share a database connection.
Other parameters to tune include:
- shared_buffers — be sure to set this pretty high, at least 25% of your available RAM. The goal is to store as much data in the cache as possible, so as not to be hit by expensive I/O operations on disk.
- effective_cache_size — be sure to tune this upward as well, so as to tell the query planner that it can afford to assume that much of a query’s data is already in the filesystem cache and thereby favor index scans. You can probably afford to set this to the remainder of RAM available, so roughly 70% of your RAM, in accordance with the shared_buffers value above, minus some overhead for kernel operations.
- work_mem — set this high. Basically, try to give as much memory to each session as possible for merging and sorting. A good starting point would be around 2% of your RAM, and tune it upward if you still need more; remember, EXPLAIN ANALYZE will help you identify any I/O bottlenecks that can be addressed with more work_mem.
- maintenance_work_mem — since you want your VACUUMs to finish quickly after the data has been loaded every night, you’ll want this value to be fairly high, maybe around 15% of your RAM. This will also help you speed up index creation.
It goes without saying that having fast hardware will help you with your OLAP workload. You’ll want an array of SSDs to quickly retrieve your data, because it will likely not all fit in memory. A RAID 10 configuration will ensure both performance and redundancy. Don’t skimp out on RAM and CPU cache—these go a long way to help you keep data in memory and avoid costly I/O. Whatever caching you can enable, from the filesystem to the OLAP application, be sure to enable it in order to give yourself the best performance.
PostgreSQL is a powerful database, and for OLAP workloads, it can certainly meet expectations. With a good deal of planning and tuning, the database engine will be able to deliver analytics at scale.