Lexical Scanning in PostgreSQL

August 20, 2020

Like in the parable of the Blind Men and an Elephant, Postgres code can look like many different applications of computer science depending on where you look: operating systems, transactions, distributed computing, linear programming, information storage and retrieval, etc. In a number of places it also looks like a compiler front end.

One might think that lexical scanning (i.e. turning a text stream into a sequence of tokens to be fed into the parser) is a straightforward thing solved many years ago, but in Postgres there are plenty of interesting challenges from an engineering as well as organizational point of view.

For one, the Postgres codebase has nine ".y" Bison files and eleven ".l" Flex files, which cover a multitude of languages and formats, including replication commands, JSON Path expressions, the bootstrap commands used by initdb, and the postgres.conf configuration file. That list doesn’t even count features for which there exist hand written parsers and scanners: JSON, pg_hba.conf, etc.

Even considering only the SQL language, there are no less than four areas of functionality that must scan SQL in different contexts:

  1. Ordinary SQL queries
  2. PL/pgSQL functions containing SQL
  3. Detecting the end of a SQL statement within the psql client
  4. Preprocessing C/C++ language files containing embedded SQL (called ECPG in Postgres)

SQL and PL/pgSQL have shared the same (so-called "core") scanner since PG 9.0, but psql and ECPG have their own scanners, which must match the core scanner in recognition ability, leading to duplicate code that must be maintained separately.

For another, SQL is a huge and still evolving language with a galaxy of corner cases and historical baggage. The SQL standards committee doesn’t have a reference implementation, so it’s anybody’s guess whether some new syntax is parseable without ugly hacks. It doesn’t help that Postgres’ dialect goes above and beyond the standard with many non-standard extensions and relaxed restrictions that make the task even more difficult.

To pick an example relevant to the topic of scanning, this query must throw a syntax error:


select 'foo' 'bar';
ERROR:  syntax error at or near "'bar'"
LINE 1: select 'foo' 'bar';
                     ^

…but this one must concatenate the strings:


select 'foo'  -- comment here
    -- newline and more comments
'bar';
 ?column?
----------
 foobar
(1 row)

…but only SQL-style comments are allowed to split a string, so this must be a syntax error:


select 'foo'
/* C-style comment */
'bar';
ERROR:  syntax error at or near "'bar'"
LINE 3: 'bar';
        ^

Recent improvements

For PG 12, I authored a patch to change the data structure used by the core scanner to recognize SQL keywords to be more cache-friendly. This resulted in a 4% speed increase in a raw parsing microbenchmark (Flex + Bison, excluding further parse analysis). This is not terribly exciting itself, but paved the way for people smarter than me to replace the binary search algorithm with a perfect hash function that upped that to a 20% speed increase over PG 11, which is nice, and led to some interesting discussion on HackerNews.

During the PG 13 development cycle, I did a series of experiments to test the effect of the “-C” Flex option flag, which controls the degree of table compression, allowing speed vs. size tradeoffs. According to the Flex manual, in the general case, “-Cem” generates the smallest and slowest scanner, and “-CF” the largest and fastest. Since version 7.3, the core scanner has used the "-CF" option based on this recommendation, but hadn’t recently been tested as far as I know. In addition, CPU and memory architecture have changed quite a bit since 2002, so I was curious if that recommendation was still valid. Perhaps unsurprisingly, "-CF" is still the fastest option.

Back in version 8.1, a policy was added to Postgres that required the core scanner to be free of backtracking. At the time it was noted that this made scanning 1/3 faster. I tested removing the Flex rules that are there simply to avoid backtracking, and found two baffling facts:

First, the resulting binary was 164 kB smaller. The reason was apparent when looking at the yy_transition array in the output C files before:

struct yy_trans_info
{
flex_int32_t yy_verify;
flex_int32_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[37045] = ...

…and after:

struct yy_trans_info
{
flex_int16_t yy_verify;
flex_int16_t yy_nxt;
};
static yyconst struct yy_trans_info yy_transition[31763] = ...

The data type used here depends on the size of the array. In particular, if the number of elements exceeds the maximum value of a 16-bit signed integer, Flex apparently must use 32-bit integers. The extra rules to avoid backtracking are enough to bump it over the threshold where the array elements must double in size.

Second, when I measured raw parsing (remember this includes both Flex and Bison), I either saw no difference, or in some cases only a 3% regression, when using backtracking. In addition to changes in computer architecture and possibly Flex itself in 15 years, one of the reasons for the small difference is that the lion’s share of the time in raw parsing on modern machines is taken by Bison, and profiles do show a lot of cache misses in the Bison code.

In any case, smaller data structures and binary are always a good thing to have, so with a bit of help from Tom Lane I authored a patch to reduce the number of state transitions in the core scanner. The point of this was to ensure use of 16-bit integers as above, while still keeping free of backtracking. This seemed to improve performance slightly, the binary shrank by over 200kB, and the scanner code among core, psql, and ECPG matched more closely, reducing maintenance burden. This will be part of the upcoming PG 13 release.

Possible future work

A lot more could be done in the realm of parsing and scanning in Postgres. Some of the following are pie-in-the-sky projects with uncertain benefit, but still interesting to think about:

  • Use a token queue to enable scanning and parsing tokens in batches. This makes better use of the caches and branch predictor.
  • Make it possible for Postgres extensions to add their own syntax. One could envision a GUC analogous to search_path, containing a list of shared libraries which implement parsers to attempt.
  • Replace Flex with scanners handwritten in C. This would reduce duplication of code and improve performance. This could be made easier by using a generated direct-coded scanner, such as that emitted by re2c, as a starting point. Handwritten scanners are actually quite common in other software projects (including databases), even those that use a parser generator such as Bison. Flex is picky, difficult to debug, and some Postgres constructs (e.g. user-defined operators) are clunky to express with it.
  • Rethink how to preprocess ECPG source files. Currently a perl script transforms the core grammar into the ECPG grammar as part of the build process. This requires a bespoke scanner that must recognize C plus special ECPG commands, as well as duplicate all the rules from the core scanner. It might be better if the two languages were treated separately, with two parsers that switched back and forth as needed. This is a research project.
  • Teach Bison to emit a direct-coded parser, as opposed to the current table-driven state machine. This would require a massive amount of work and expertise, but this would likely result in a large performance gain.

As you can see, even in an area most users and even developers take for granted, there are plenty of avenues for those motivated to explore.

Takeaways

  • It can be fruitful to simply read source code and be curious about the current state of affairs.
  • It often pays to dig down to a lower abstraction layer. Even good tools and libraries shouldn’t be taken for granted as a black box. The more you learn about the details, the more effective you are as a user.
  • Performance measurements can change drastically in different times and contexts.
Share this

More Blogs