Performance of parallel index scans in PostgreSQL

July 24, 2018

This blog will continue the discussion of parallel query in PostgreSQL. In the previous blog of this series, we learned about parallel index scans, its design in PostgreSQL and the performance improvement achieved for a few queries on the industrial benchmark of TPC-H. Therein we analyzed the performance improvement only for a small factor of 20 (database size was approximately 20GB). But the performance benefits realized by parallel operators aren’t that significant until we leverage them for higher scale factors. Hence, in this blog, we will analyze the performance of parallel index scans on 300 scale factor.

For this experiment we used TPC-H inspired benchmark for PostgreSQL.  We used 300 scale factor, which gives 300+ GB of database, depending on the available indexes, etc. Additional indexes we created were on columns (l_shipmode, l_shipdate, o_orderdate, o_comment). We tuned following parameters,

  • random_page_cost = 0.1
  • seq_page_cost = 0.1
  • effective_cache_size = 10GB
  • shared_buffers = 10GB
  • work_mem = 1GB

In the table below, we compared the performance of the queries using parallel-index scans on v10 to their performance in v9.6. Note that in v9.6 parallel seq scans and parallel nested loop joins were available. Additionally, we tabulated the total contribution of parallel index-scan for each of the queries, it is the execution time of parallel index scan node in explain analyze output of the respective query. All the values of timing are in seconds.

TPC-H query

On v9.6

On v10

Contribution of PIS

6

883

155

137

12

740

717

122

14

422

111

53

15

2295

1138

157

17

17724

3376

16

 

From the above table, it is clear that the benefits of parallel index scans are significant on large amounts of data as well. To get a better understanding of how this scan method benefits the queries in question, we studied their query plans on both versions. On analyzing those query plans we found two primary explanations for the benefits attained:

  • Contribution of parallel index scan in total execution time of the query: If the most time-consuming operation in a query is index scan then parallelizing it is likely to improve query performance significantly, e.g. Q6 and Q12. To be more precise have a look at the query plan of Q6,

         Q6 Query plan on v9.6

         ———————————

         Limit

              ->  Aggregate  

                  ->  Bitmap Heap Scan on lineitem

                      ->  Bitmap Index Scan on idx_lineitem_shipdate

        Execution time: 882812.376 ms

 

       Q6 Query plan on v10

        ———————————

        Limit

         ->  Finalize Aggregate

             ->  Gather  

                  ->  Partial Aggregate

                       -> Parallel Index Scan using idx_lineitem_shipdate on lineitem                          

      Execution Time: 155619.579 ms

Here, scan is the only time-consuming operator, hence, parallelizing it improves the performance of the query. Similarly for Q12, the index scan is on lineitem which is the largest table of database. Once it is scanned in parallel, the join above it could also be done in parallel and hence the improvement in performance.

On the other hand, the benefits are not so pronounced when the plan is more complex and involves a number of operators with index scan as one of the less significant operator in terms of total execution time of the query.

  • Operators above it could also leverage parallelism: In some of the queries, it has been observed that even when the contribution of parallel index scan is not much in total execution time of the query, the benefits are significant, e.g. Q15 and Q17. In such cases, the benefits are not coming from parallel index scans alone but from the fact that now more operators could be pushed to workers and hence extend parallelism upto higher levels. To explain this observation, query plan for Q15 is used:

         Q15 Query plan on v9.6

         —————————————

         Limit

             InitPlan 1 (returns $0)

              

          -> Merge Join  

              Merge Cond: (lineitem.l_suppkey = supplier.s_suppkey)

                    -> GroupAggregate  

                        Group Key: lineitem.l_suppkey

                        -> Sort

                              -> Bitmap Heap Scan on lineitem

                                      -> Bitmap Index Scan on idx_l_shipdate

                       -> Index Scan using supplier_pkey on supplier  

              Execution time: 2297222.717 ms

 

             Q15 Query plan on v10

             —————————————

             Limit

                InitPlan 1 (returns $0)

                    

             -> Nested Loop  

                   ->  Finalize GroupAggregate

                        Group Key: lineitem.l_suppkey

                         ->  Gather Merge  

                              ->  Sort  

                                    ->  Partial HashAggregate

                                         Group Key: lineitem.l_suppkey

                                   ->  Parallel Index Scan using idx_lineitem_shipdate on Lineitem

                  ->  Index Scan using supplier_pkey on supplier

             Execution Time: 1133723.985 ms

 

In this case, with the introduction of parallel index scan the plan structure changed altogether. With parallel index scan, the aggregates are pushed to workers and the join method is also changed, eventually improving query performance. This shows that it may not be necessary for parallel index scan to be the most time consuming operator for it to improve query performance, rather having it may help other operators to use parallelism which could not be done otherwise.

Overall, we saw there are significant benefits of parallel index scans on large volume of data as well. Thus, aiming towards a more OLAP compatible and scalable version of PostgreSQL. Stay tuned to know about more groundbreaking features in PostgreSQL.

Share this

Relevant Blogs

What is pgvector and How Can It Help You?

There are a thousand ways (likely more) you can accelerate Postgres workloads. It all comes down to the way you store data, query data, how big your data is and...
November 03, 2023

More Blogs

pgAdmin CI/CD

Almost exactly three years ago I wrote a blog on my personal page entitled Testing pgAdmin which went into great detail discussing how we test pgAdmin prior to releases. Back...
August 24, 2023