How Does Postgres Perform Across File Systems?

Tomas Vondra

October 10, 2022

One of the guiding Postgres design principles is heavy reliance on features provided by the environment (particularly operating system) and file systems are a prime example of this. Unlike other databases Postgres never supported raw devices, i.e. the ability to store data on block devices without creating a regular file system first—that would require implementing a “custom” file system, which might be tailored for the database needs (and thus faster) but it would require significant investment of development time (e.g. to support different platforms, adjust to ever-evolving hardware, etc).

As the development time is an extremely precious resource (doubly so in a small development team in early stages of an open-source project), the rational choice was to expect the operating system to provide a sufficiently good general-purpose file system, and focus on implementing database-specific features with high added value for the user.

So you can run Postgres on many file systems, which raises the question: Are there any significant differences between them? Does it even matter which filesystem you pick? Exploring these questions is the aim of this analysis.

I’ll look at a couple common file systems on Linux—both traditional (ext4/xfs) and modern (zfs/btrfs) ones, run an OLTP benchmark (pgbench) on SSD devices in different configurations, and present the results along with a basic analysis.

Note: For OLTP, it’s not really practical (or cost-effective) to use traditional disks. The results might be different and perhaps interesting, but ultimately useless.

Setup

I’ve used my two “usual” machines with different hardware configurations, a small one with SATA SSD and a bigger one with NVMe SSD.

Intel i5-2500K (4 cores)
8GB RAM
6x Intel DC S3700 100GB (SATA SSD)
PG: shared_buffers=1GB, checkpoint_timeout = 15m, max_wal_size = 64GB
ZFS tuning: full_page_writes=off, wal_init_zero = off, wal_recycle = off

xeon

2 x Intel e5-2620v3 (16/32 cores)
64GB RAM
1x WD Gold SSD 960GB (NVMe)
PG shared_buffers=8GB, checkpoint_timeout = 15m, max_wal_size = 128GB
ZFS tuning: full_page_writes=off, wal_init_zero = off, wal_recycle = off

Both machines are running kernel 5.17.11 and zfs 2.1.4 or 2.1.5.

On the i5 machine we can test different RAID configurations too. Both zfs and btrs allow using multiple devices directly, for ext4/xfs we can use mdraid. The names (and implementations) of the RAID levels differ a bit, so here’s a rough mapping:

name (data disks)	mdraid	btrfs	zfs
striping (N)	raid0	raid0	striping
mirroring (1)	raid1	raid1*	mirroring
raid10 (N/2)	raid10	raid10	(mirroring + striping)
raid5 (N-1)	raid5	raid5	raidz1
raid6 (N-2)	raid6	raid6	raidz2
raid7 (N-3)			raidz3

Note: The number in the first column means number of “data bearing” disks. For example with striping all N disks are used to store data. With raid5, one of the drives is used to store parity information, so we only use (N-1) data disks. With raid10, we keep a copy for each piece of data, so we have N/2 capacity.

Note: btrfs defines raid1 a bit differently from mdraid, as it means “2 copies” and not “N copies”, which makes it more like raid10 than mirroring. Keep this in mind when interpreting the results/charts.

Note: btrfs implements raid5/6, but this support is considered experimental. It’s been like this for years, and considering the change is more to discourage people from using btrfs with RAID5/6 than to fix the issues, I’m not holding my breath. It’s available, so let’s test it but keep this in mind.

For more detailed description of the various RAID levels, see here.

Some basic file system tuning was performed to ensure proper alignment, stripe/stride and mount options. For ZFS, the tuning is a bit more extensive and sets this for the whole pool:

recordsize=8K
compression=lz4
atime=off
relatime=on
logbias=latency
redundant_metadata=most

All the scripts (and results) from this benchmark are available on my github. This includes setup of the RAID devices/pools, mounting etc.

Note: If you have a suggestion what other mount options to use (or any other optimization idea), let me know. I picked the optimizations that I think matter the most, but maybe I was wrong and some of those options would make a big difference.

i5 (6 x 100GB SATA SSD)

Let’s look at the smaller machine first, comparing the file systems on multiple devices in various RAID configurations. These are simple “total throughput” numbers, from sufficiently long pgbench runs (read-only: 15 minutes read-write: 60 minutes) for different scales.

read-only

For read-only runs, the results are pretty even, both when the data fits into RAM (scale 250 is ~4GB) and when it gets much larger (scale 5000 is ~75GB).

The main difference seems to be that zfs is consistently slower—I don’t know why exactly, but considering this affects even the case when everything fits into RAM (or ARC), my guess is this is due to using more CPU than the “native” Linux file systems. This machine only has 4 cores, so this may matter.

raid	zfs/default	zfs/tuned	btrfs	ext4	xfs
mirror	69531	70153	101083	99590	97654
raidz1	70339	70673	100078	98959	99992
raidz2	71367	70145	99138	100893	100689
raidz3	70126	70717
raid10	70747	71381	99963	101576	101205
stripe	70866	70383	101327	101024	97265

Table: Results for i5/scale 250/read-only

raid	zfs/default	zfs/tuned	btrfs	ext4	xfs
mirror	22983	23080	34806
raidz1	21617	21785	34889	39680	40401
raidz2	21645	21796	34729	39566	39979
raidz3	21682	21662
raid10	22647	22585	35005	41899	41638
stripe	22720	22628	34906	40145	40528

Table: Results for i5/scale 5000/read-only

read-write

For read-write tests, the results are much less even, both for all dataset sizes.

There are a couple interesting observations. Firstly, ext4/xfs clearly win, at least when it comes to total throughput (we’ll look at other stuff in a minute). Secondly, for the small data set ZFS is almost as fast as ext4/xfs, which is great—this covers a lot of practical use cases, because even if you have a lot of data, chances are you only access a tiny subset.

The unfortunate observation is that btrfs clearly underperforms, for some reason. This is particularly visible for the smaller data set, where btrfs tops at ~5000 tps, while the other filesystems achieve much higher (2-3x) throughputs.

Note: This is the place to remember btrfs defines raid1 differently, which means the “good” result for the larger data set is not that great because instead of 6 copies btrfs only keeps 2 (thus having 3x more data disks than the other file systems).

raid	zfs/default	zfs/tuned	btrfs	ext4	xfs
mirror	6386	7381	4839	9596	9420
raidz1	10297	11088	3393	11650	11835
raidz2	9255	10076	2898	11035	11433
raidz3	8109	8995
raid10	12348	12381	4857	14914	14652
stripe	13618	13861	5020	17335	16881

Table: Results for i5/scale 250/read-write

raid	zfs/default	zfs/tuned	btrfs	ext4	xfs
mirror	889	1077	1982
raidz1	1967	2470	1564	3333	3487
raidz2	1760	2141	1436	2492	2683
raidz3	1519	1815
raid10	1933	2370	2013	4491	4502
stripe	3064	3917	2199	5826	5891

Table: Results for i5/scale 5000/read-write

xeon (1x NVMe SSD)

Now, let’s look at a “bigger” machine, which however only has a single SSD (NVMe) device. So we can’t test various RAID levels, which means we can present the results in two simple charts.

scale	zfs/default	zfs/tuned	btrfs	ext4	xfs
100	494854	488757	486214	488456	498001
1000	353343	356491	424452	425123	422971
10000	93711	92454	125503	139685	143471

Table: read-only results

scale	zfs/default	zfs/tuned	btrfs	ext4	xfs
100	61028	64057	33352	85291	85765
1000	32568	36066	27952	58736	58991
10000	3822	6068	6541	30495	30977

Table: read-write results

For the read-only tests, the results are (again) pretty even - ZFS is a bit slower than the other file systems, particularly for the larger scales, similarly to the smaller machine.

For read-write tests, the differences are much more dramatic. EXT4/XFS are the clear winner. ZFS keeps pace on the two smaller data sets (that fit into shared buffers / RAM), but on the scale 10000 it drops to only ~6k tps (compared to ~30k tps for ext4/xfs). Btrfs performs quite poorly—on the small/medium data sets it’s somewhat competitive with ZFS, but other than that it’s clearly underperforming.

Throughput over time

There’s one important thing the throughput results presented so far ignore—stability of the throughput, i.e. how it evolves over time. Imagine two systems running the same benchmark:

system A does 15k tps consistently, with minimal variation during the benchmark run
system B does 30k tps in the first half, and then gets stuck and does 0 tps

Both systems do 15k tps on average, and so would have the same chart in the charts presented so far. But I assume we’d agree system A has much more consistent and predictable performance.

So let’s look not just at the total throughput, but also at how throughput develops over time. The charts in this section show throughput for each second during a 60-minute read-write benchmark (blue), with a running average over 15-second intervals (red).

This should give you some idea how much the throughput fluctuates, and also trends / patterns (degradation over time, impact of checkpoints…).

i5 / scale 1000

All results in this section are from RAID0/striping setup. For scale 1000 (exceeds RAM, but small enough compared to wal_max_size), we get this:

--------

It’s immediately obvious that ZFS has exceptionally clean and consistent behavior - with the default config there’s some checkpoint impact, but disabling FPW makes that go away. As a DBA, this is what you want to see on your systems—minimum differences (jitter) during the whole benchmark run.

EXT4/XFS achieve higher throughput (~7.5k tps vs. 6.5k tps, so ~20% increase), but the jitter is clearly much higher. The per-second throughput varies roughly between 5k and 9k tps—not great, not terrible. There’s also clear checkpoint impact, but we can’t disable FPW to eliminate this—we could only make checkpoints less frequent by increasing the timeout / WAL size.

BTRFS has more issues, though—the throughput is lower than with ZFS while the jitter is much worse than with EXT4/XFS. On average (the red line) it’s not that bad, but it regularly drops close to 0, which implies high latency for some transactions.

i5 / scale 5000

Now let’s look at the “large” data set (much larger than RAM).

--------

The first observation is that the impact of checkpoints is much less visible—both for ZFS with the default config and EXT4/XFS. This is due to the dataset size and random access—this means the number of FPW produced can’t drop before hitting the WAL limit (and starting the next checkpoint).

For BTRFS, the situation is much worse than on the medium (scale 1000) data set—not only the jitter is even worse, but the throughput is gradually dropping over time—it starts close to 4000 tps, but at the end it drops to only about 2000 tps.

xeon / scale 1000

Now let’s look at throughput on the larger machine, with a single NVMe device. On the medium scale (larger than shared buffers, fits into RAM) it looks like this:

--------

On ZFS with default configuration, the checkpoint pattern is clearly visible. With the tuned configuration, the pattern disappears and the behavior gets much more consistent. Not as smooth as on the smaller machine with multiple devices, but better than the other file systems.

There’s very little difference between EXT4 and XFS, both in total throughput and behavior over time. The total throughput is better than with ZFS (40k vs 60k), but the jitter is more severe.

For BTRFS, the overall throughput is fairly low (~30k tps), while the jitter is somewhat better and worse than for EXT4/XFS at the same time. It seems mostly closer to the average throughput, but at the same time it often drops to 0.

xeon / scale 10000

On the large (~150GB) data set, the results look like this:

--------

This is similar to what we saw on the smaller machine—checkpoint pattern disappears, due to the amount of FPW we have to write to WAL.

On ZFS with the tuned configuration, the jitter increases significantly. I’m not sure why, but I’d say it’s still fairly consistent (compared to the other file systems).

For EXT4/XFS, the throughput drops to ~50% (compared to the medium scale) while the jitter remains about the same—I’d argue this is a pretty good result.

With BTRFS we see the same unfortunate behavior as on the smaller machine - with the large data set the throughput gradually decreases, with significant jitter.

Conclusions

So, what have we learned? The first observation is that there are significant differences between the evaluated file systems—it’s hard to pick a clear winner, though.

The traditional file systems (EXT4/XFS) perform very well in OLTP workloads, at least in total throughput. ZFS performs pretty well too—it’s a bit slower in terms of throughput, but the behavior is very consistent with minimal jitter (particularly with disabled full-page writes). It’s also quite flexible—for example it allows moving ZIL to a separate device, which should improve the throughput.

As for BTRFS, the results are not great—I did a similar OLTP benchmark a couple years ago, and this time BTRFS performed a bit better, in fact. However, the overall consensus seems to be that BTRFS is not particularly well suited for databases, and others observed this too. Which is a bit unfortunate, as some of the features (higher resilience, easy snapshotting) are very useful for databases.

The last thing to keep in mind when reading the results is to realize these are stress tests, fully saturating the I/O subsystem. That’s not what would be happening on most production systems—once the I/O gets saturated for extended periods of time, you’d be already thinking about upgrading the system to reduce the storage load. Which typically means the latencies should improve, etc.

In this Article

Setup
read-write
xeon (1x NVMe SSD)

Resource Feature Callout 1

Postgres vs. File Systems: A Performance Comparison

Tomas Vondra

Setup

i5 (6 x 100GB SATA SSD)

read-only

read-write

xeon (1x NVMe SSD)

Throughput over time

i5 / scale 1000

i5 / scale 5000

xeon / scale 1000

xeon / scale 10000

Conclusions