How to improve Row Estimates with Expression Indexes in Postgres

January 20, 2023

In my previous blog post, I showed how statistics generated on expression indexes can be used to produce more accurate row counts and potentially better plans. While I did show more accurate row counts via explain, I did not show changed query plans. I plan to do so in this blog post.

First, the setup:

CREATE TABLE test1 AS
        SELECT * FROM generate_series(1, 100000) AS f(x);
CREATE TABLE test2 AS
        SELECT * FROM generate_series(1, 2) AS f(x);

ANALYZE test1;
ANALYZE test2;

 

Then a join query using modulus with no expression index:

EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
                           QUERY PLAN
-----------------------------------------------------------------
 Nested Loop  (cost=0.00..1959.02 rows=1 width=4)
   Join Filter: (test1.x = test2.x)
   ->  Seq Scan on test1  (cost=0.00..1943.00 rows=500 width=4)
         Filter: ((x % 2) = 1)
   ->  Materialize  (cost=0.00..1.03 rows=2 width=4)
         ->  Seq Scan on test2  (cost=0.00..1.02 rows=2 width=4)

 

A nested loop join is used, which is suboptimal because the row count for test1 is one hundred times too small.

With proper statistics on the modulus operation on test1.x, a more efficient hash join is used:

CREATE INDEX i_test1 ON test1((x % 2));
ANALYZE test1;
ANALYZE test2;

EXPLAIN SELECT test1.x
FROM test1, test2
WHERE test1.x % 2 = 1 AND test1.x = test2.x;
                            QUERY PLAN
------------------------------------------------------------------
 Hash Join  (cost=1.04..2132.29 rows=1 width=4)
   Hash Cond: (test1.x = test2.x)
   ->  Seq Scan on test1  (cost=0.00..1943.00 rows=50197 width=4)
         Filter: ((x % 2) = 1)
   ->  Hash  (cost=1.02..1.02 rows=2 width=4)
         ->  Seq Scan on test2  (cost=0.00..1.02 rows=2 width=4)

 

Notice the test1 row count is now much more accurate, and that analyzing the base table also analyzes the expression index. The total cost is now slightly higher (2132.29 vs. 1959.02), but that is not because the hash join is more expensive. Rather, it is because the nested loop misestimated how many rows it would need to process because it didn't know the selectivity of the modulus operation.

One thing I learned in researching this blog post is how much the optimizer "loves" hash joins. If test2 has three or more rows, or if test1 has ten times more rows and parallelism is enabled, a hash join is used even without expression index statistics. Hash joins are very robust despite misestimation so they are favored by the optimizer. The takeaway is that the creation of expression indexes for statistical purposes is recommended only if testing shows they actually improve query plans, i.e. improving explain row counts alone has little benefit.

 

This post originally appeared on Bruce's personal blog

 

Share this

Relevant Blogs

More Blogs