Combining Partitioning and FDWs for Real Time Analytics

March 14, 2017

Since v9.1 PostgreSQL has been steadily improving its Foreign Data Wrapper (FDW) capabilities. FDWs are a way to access and manipulate data external to PostgreSQL from a PostgreSQL server. The technology is based on the SQL/MED standard, which represents foreign data in the form of relational tables called foreign tables. Starting with v10, PostgreSQL will support declarative partitioning, similar to the partitioning support in many popular DBMSs. What’s more interesting is that these partitions can be foreign tables, bringing two great technologies together! Here’s a glimpse of what this confluence will do.

Real-time Analytics Scenario

Consider a database recording purchases made by customers in various stores. Since the customers have an affinity towards stores in regions they are located, each region has its own server, which records transactions by customers within that region. A region here can be as large as a continent or as small as a province, depending upon the nature of business, volumes of transactions, etc. The transactions are required to be analyzed at a central server periodically for deciding promotions, product distribution, etc. The traditional method to do so would be to dump-load-analyze.

With the help of partitioning and FDWs, the central server can create tables partitioned by region to consolidate all the data. The partitions would be foreign tables pointing to their counterparts in regional servers. This saves time spent in dumping data on regional servers and loading it to the central server. Also, the reports on the central server can be generated any time. Regional standbys can be used as foreign servers in case the central server only generates reports from this regional data.

Performance Numbers

I tried to measure performance gains simulating the above scenario on my laptop. Assume two tables in every regional server, a. customers: recording information about the customers, b. orders: recording purchases made by customers. A central server has the partitioned table 'all_customers' and 'all_orders'. Customer id in every region has the higher two digits corresponding to the region (e.g. 01, 02, 03, ...). Thus partitioning tables by customer id partitions those by region. (Depending upon the other queries, we may choose to create 'region' as a column in all the required tables and then partition the tables in the central server by region.). I tried a query which produces the names and other details of top 10 consumers. I had four partitions each with 1 million customers, and for each customer up to 10 purchases. The real data would have lot more customers and transactions per customer than that and accordingly the scale up would be better.

SELECT count(o.prod_id) sale_count, -- sum(price) would be more appropriate
c.id -- add c.name, c.email etc. here
FROM all_customers c, all_orders o
WHERE c.id = o.cust_id
GROUP BY c.id
ORDER BY sale_count
LIMIT 10;

Without partition-wise joins, it fetched all the data from all the regions, performed a local join and aggregation. The query took 176 seconds to complete.

With partition-wise join, and postgres_fdw join pushdown but without partition-wise aggregation, the query took 155 seconds (a 12% improvement).

With partition-wise join, partition-wise aggregation, postgres_fdw join pushdown and aggregate pushdown, it took 62 seconds, (a 65% improvement). In this case, aggregation produced total 4 million rows from each region and each query on the foreign server took approximately 15 seconds. Extrapolating that, we can say that a query which takes hours to run would now require minutes to run.

If we were to have asynchronous query, each foreign server can be queries in parallel, thus completing the query in approximately 17-20 seconds (an 88% improvement). If we could push SORT and LIMIT down to the foreign server, we would be fetching 10 rows instead of 4 million, thus improving the query further.

If the company requires real-time decision making about promotions targeting high-profile consumers, or product distribution and so on, it could do so using a combination of partitioning and FDWs, as compared to non real-time dump-then-load-then-analyze approach.

Patches under development

All the query optimizations we talked about are either available in PostgreSQL or are under development and will be available with next few releases. I have proposed partition-wise join optimization to the community here. It’s being reviewed by Robert Haas. Jeevan Chalke is working on partition-wise aggregation, which will soon be proposed to the community. Asynchronous query is being discussed on hackers here.

Ashutosh Bapat is a Developer at EnterpriseDB.

This post originally appeared on Ashutosh's personal blog.

Resource Feature Callout 1

Combining Partitioning and FDWs for Real Time Analytics