Managing large tables is a big challenge. But maintaining good performance and manageability for those large tables is even a bigger challenge. Luckily, Postgres 11 provides several ways of dealing with this problem. In our series on Postgres performance, we will discuss table partitioning in this first part and indexing in the next. In this article, we describe different Postgres partitioning options, how to configure partitioning, and the use cases where partitioning can be used to increase performance. In addition, we describe the other benefits of partitioning as well as things to consider when deciding whether to use partitioning.
A partitioned table is a logical structure used to divide a large table into smaller pieces called partitions. To divide data into partitions, we define a partition key and a partitioning method. The partition key is usually a column in the table, but it can also be an expression. The partitioning method is part of a partitioned table declaration and is determined according to the use case.
DBAs and Developers who are familiar with this technique can improve query performance and ease of maintenance. In this article, we create a partitioned table in Postgres and demonstrate how the optimizer can use it to improve query performance.
Three Partitioning Methods
Postgres provides three built-in partitioning methods:
- Range Partitioning: Partition a table by a range of values. This is commonly used with date fields, e.g., a table containing sales data that is divided into monthly partitions according to the sale date.
List Partitioning: Partition a table by a list of known values. This is typically used when the partition key is a categorical value, e.g., a global sales table divided into regional partitions. The partition key in this case can be the country or city code, and each partition will define the list of codes that map to it.
Hash Partitioning: Partition a table using a hash function on the partition key. This is especially useful when there is no obvious way of dividing data into logically similar groups and is often used on categorical partitioning keys that are accessed individually. E.g., if a sales table is often accessed by product, the table might benefit from a hash partition on the product SKU.
Deciding on the partitioning key and method is crucial to performance, as, when set correctly, partitions can have a significant positive effect on performance and manageability.
Creating a Partitioned Table
To demonstrate how partitioned tables work in Postgres, start by creating a sales table:
CREATE TABLE sale (
sale_date date not null,
) PARTITION BY RANGE (sale_date);
The sales table contains the aggregate amount of units sold for each day, country, and product. The table is used to produce monthly and daily sales reports, so you can benefit from range partitions on the sale_date column.
Once the partitioned table is defined, you can start adding partitions to it. Each partition defines a range of values for the partition key:
CREATE TABLE sale_201901 PARTITION OF sale
FOR VALUES FROM ('2019-01-01') TO ('2019-02-01');
CREATE TABLE sale_201902 PARTITION OF sale
FOR VALUES FROM ('2019-02-01') TO ('2019-03-01');
You added two partitions, one for January 2019 and another for February 2019.
This way of creating partitions is called “declarative partitioning.” Prior to Postgres 10, only EDB Postgres Advanced Server provided declarative partitioning syntax. In open source PostgreSQL, partitioned tables were declared using table inheritance. Declarative partitions are a bit more restricted than inheritance-based partitions, but they are easier to understand and maintain and are suitable for most use cases.
To complete the partition table, you need some data:
INSERT INTO sale
(sale_date, country_code, product_sku, units)
('2019-01-01', 'NY', 'P-000A1', 65),
('2019-01-02', 'NY', 'P-000A1', 3),
('2019-01-03', 'NY', 'P-000A2', 43),
('2019-02-01', 'NY', 'P-000A2', 69),
('2019-02-02', 'NY', 'P-000A1', 2),
('2019-02-03', 'NY', 'P-000A1', 17),
('2019-01-01', 'NJ', 'P-000A2', 70),
('2019-01-02', 'NJ', 'P-000A2', 98),
('2019-01-03', 'NJ', 'P-000A1', 39),
('2019-02-01', 'NJ', 'P-000A1', 17),
('2019-02-02', 'NJ', 'P-000A2', 82),
('2019-02-03', 'NJ', 'P-000A2', 46);
The database will add each row to it’s appropriate partition based on the partitioning key, sale date, and the range defined for each partition.
Creating a Default Partition
To better understand how Postgres is dividing data into partitions, this is what happens when you try to insert a row with a sale date for which you haven’t yet defined a partition:
db=# INSERT INTO sale
db-# (sale_date, country_code, product_sku, units)
db-# ('2019-04-01', 'NY', 'P-000B1', 65);
ERROR: no partition of relation "sale" found for row
DETAIL: Partition key of the failing row contains (sale_date) = (2019-04-01).
When you tried to insert a row for April 2019, the command failed because there is no partition for April in the table.
To avoid this, you can add a default partition:
CREATE TABLE sale_default PARTITION OF sale FOR DEFAULT;
The default partition will store rows that did not match any other partition:
db=# INSERT INTO sale (sale_date, country_code, product_sku, units)
db-# VALUES ('2019-04-01', 'NY', 'P-000B1', 65);
INSERT 0 1
To make sure the row was added to the default partition, query the default partition directly:
db=# select * From sale_default;
sale_date | country_code | product_sku | units
2019-04-01 | NY | P-000B1 | 65
It’s not always useful to add a default partition. You may want to make sure that all rows are fitted nicely into the predefined partitions, as storing rows in the default partition can affect performance under some circumstances and can make maintenance more difficult.
Partitioning a table enables the query engine to eliminate entire partitions based on specific conditions in the query, if the query refers to the partitioning key.
For example, let’s look at the execution plan of a query to fetch sales from January 2019:
db-# EXPLAIN SELECT * FROM sale WHERE sale_date BETWEEN '2019-01-01' and '2019-01-31';
Append (cost=0.00..22.17 rows=4 width=72)
-> Seq Scan on sale_201901 (cost=0.00..22.15 rows=4 width=72)
Filter: ((sale_date >= '2019-01-01'::date) AND (sale_date <= '2019-01-31'::date))
The execution plan shows that Postgres scanned only a single partition, sale_201901. The query planner was smart enough to understand that, given the predicate in the query, the data can only reside in this specific partition.
The ability of the query planner to eliminate entire partitions is called partition pruning. Partition pruning is one of the main motivations for using a partitioned table, as it reduces IO by eliminating entire chunks of the table. When a query requires less IO, it’s usually much faster. To enable partition pruning, make sure the parameter enable_partition_pruning is on.
To better understand partition pruning, let’s examine an execution plan of a query that cannot benefit from partition pruning:
db=# EXPLAIN SELECT * FROM sale WHERE country_code = 'NY';
Append (cost=0.00..40.29 rows=8 width=72)
-> Seq Scan on sale_201901 (cost=0.00..20.12 rows=4 width=72)
Filter: (country_code = 'NY'::text)
-> Seq Scan on sale_201902 (cost=0.00..20.12 rows=4 width=72)
Filter: (country_code = 'NY'::text)
The query fetched sales made in NY. According to the execution plan, the query performed two sequential scans on both partitions of the table. The query cannot benefit from partition pruning because the partition key, sale_date, is not present in the query WHERE clause.
When to Use Partitioned Tables
As we’ve seen in the previous section, the main performance benefit of using partitioned tables stems from the query planner’s ability to eliminate entire partitions and thus reduce IO and make the query faster.
To decide whether a table is a good candidate for partitioning, the following considerations should be taken into account:
When most queries need to access only one or few partitions, the database will be able to utilize partition pruning. However, when queries often access more than a few partitions, partitioning may reduce performance.
When the partition key is used as a predicate or group by, the query planner can utilize it to improve the query performance.
Unlike other performance enhancing methods such as indexing, partitions are part of the table definition and are significantly harder to change.
To ensure optimal results, the partition key and method should be derived from business requirements. For example, if a table is used to generate yearly sales reports, it might make more sense to partition the table by year, not by month. If the sales table is mostly used to query for specific products or country codes, it might also make more sense to use a different partitioning method and key.
Additional Benefits of Partitioned Tables
Aside from the performance benefits, partitioned tables are also useful in other ways.
A useful feature is the ability to attach and detach partitions from a partitioned table. Managing a large table in small chunks makes it easier to implement data loading processes.
A common Extract Transform and Load (ETL) process usually starts by loading data from remote data sources into a temporary table, followed by some processing of the data. Once the processing is complete, the data in the temporary table is loaded into the main table.
Using partitions, we can attach an existing table into a partitioned table via a single command:
db=# CREATE TABLE sale_201903 (LIKE sale);
db=# \d sale_201903
Column | Type | Collation | Nullable | Default
sale_date | date | | not null |
country_code | text | | |
product_sku | text | | |
units | integer | | |
db=# INSERT INTO sale_201903
db-# (sale_date, country_code, product_sku, units)
db-# ('2019-03-01', 'NY', 'P-000A1', 165),
db-# ('2019-03-01', 'NY', 'P-000A2', 615),
db-# ('2019-03-02', 'NJ', 'P-000A1', 382),
db-# ('2019-03-02', 'NJ', 'P-000A2', 482);
INSERT 0 4
db=# ALTER TABLE sale ATTACH PARTITION sale_201903
db-# FOR VALUES FROM ('2019-03-01') TO ('2019-04-01' );
The benefit of this approach is that ATTACH is a DDL command. DDL commands manipulate table metadata, so they happen near instantly with minimal interference to ongoing queries on the table.
When managing huge amounts of data, it’s sometimes necessary to set up some sort of lifecycle policy. Partitions makes it easier to maintain and implement such policies. For example, if a company is required to keep sales information only for one year, old partitions can be detached and deleted easily:
db=# ALTER TABLE sale DETACH PARTITION sale_201803;
db=# DROP TABLE sale_201803;
Just like attaching a partition, detaching a partition is a DDL command which causes minimal interference to ongoing operations and happens near instantly. Also, removing this large chunk of the table will not cause any bloat because the data is dropped and not deleted.
Storage parameters can also be set at the partition level. For example, old partitions can be moved to a tablespace using cheaper storage:
ALTER TABLE sale_201802 SET TABLESPACE cheap_storage_tbs;
Old partitions that are no longer being updated can be set with a lower fillfactor to utilize space better, reduce IO for queries, and improve overall performance:
ALTER TABLE sale_201902 SET (fillfactor = 10);
VACUUM FULL sale_201902;
Managing large tables in partitions makes lifecycle management easier. Partitions are just tables, so a lot of the operations that were executed at the table level can now be applied to parts of a table.
Partitioned tables come with a lot of benefits, but they also have some restrictions. These are some of the issues to factor in when considering partitioning a table:
The partition key is usually not the primary key of the table. Adding the partition key to the primary key constraint makes a composite primary key, which may pose a challenge for some ORMs.
It is possible to create primary key and unique constraints on partitioned tables, but these cannot be referenced by foreign keys. In a data warehouse environment, partitioned tables are normally used for fact tables, which are usually not referenced by foreign key constraints. However, in an OLTP environment, this is not always the case.
Tables with a lot of partitions make it more difficult for the query optimizer to come up with an execution plan. As a rule of thumb, a few hundred partitions is fine, but more than that may be an issue. In data warehouse environments, query planning time is usually not a problem because a lot of data is being accessed, and queries tend to take longer. In OLTP environments, however, longer query planning time may become an issue.
Before row triggers are often used to populate calculated fields or implement business rules. To set these triggers on a partitioned table, they must be defined on each individual partition, adding overhead to the maintenance of partitioned tables.
When a query is unable to benefit from partition pruning, it might become slower. To avoid such cases, it’s important that the partition key and method follow strict business rules.
A common maintenance task for partitioned tables is adding new partitions. This is mostly relevant for list and range partitions. In addition, since each partition is a table on its own, managing multiple tables might require additional maintenance.
EDB Postgres has provided additional partitioning capabilities, such as sub-partitioning, into EDB Postgres Advanced Server.
Even given all the above considerations, table partitioning is very useful in many circumstances.
Partitions are a significant tool in a DBA’s and System Architect’s tool chain. When used correctly and in accordance with business requirements, they can have a significant impact on performance by allowing the database to use partition pruning, which eliminates irrelevant partitions, reduces IO, and improves query time. Partitions can also support complex ETL processes by allowing developers and DBAs to operate on individual partitions. Lastly, partitions can facilitate lifecycle management policies at the partition level with minimal impact to users.
Deciding on a partitioning strategy is a crucial decision. And choosing the right one will make all the difference.