Postgres vs. File Systems: A Performance Comparison

October 10, 2022

One of the guiding Postgres design principles is heavy reliance on features provided by the environment (particularly operating system) and file systems are a prime example of this. Unlike other databases Postgres never supported raw devices, i.e. the ability to store data on block devices without creating a regular file system first—that would require implementing a “custom” file system, which might be tailored for the database needs (and thus faster) but it would require significant investment of development time (e.g. to support different platforms, adjust to ever-evolving hardware, etc).

As the development time is an extremely precious resource (doubly so in a small development team in early stages of an open-source project), the rational choice was to expect the operating system to provide a sufficiently good general-purpose file system, and focus on implementing database-specific features with high added value for the user.

So you can run Postgres on many file systems, which raises the question: Are there any significant differences between them? Does it even matter which filesystem you pick? Exploring these questions is the aim of this analysis.

I’ll look at a couple common file systems on Linux—both traditional (ext4/xfs) and modern (zfs/btrfs) ones, run an OLTP benchmark (pgbench) on SSD devices in different configurations, and present the results along with a basic analysis.

Note: For OLTP, it’s not really practical (or cost-effective) to use traditional disks. The results might be different and perhaps interesting, but ultimately useless.

Setup

I’ve used my two “usual” machines with different hardware configurations, a small one with SATA SSD and a bigger one with NVMe SSD.

i5

  • Intel i5-2500K (4 cores)
  • 8GB RAM
  • 6x Intel DC S3700 100GB (SATA SSD)
  • PG: shared_buffers=1GB, checkpoint_timeout = 15m, max_wal_size = 64GB
  • ZFS tuning: full_page_writes=off, wal_init_zero = off, wal_recycle = off

xeon

  • 2 x Intel e5-2620v3 (16/32 cores)
  • 64GB RAM
  • 1x WD Gold SSD 960GB (NVMe)
  • PG shared_buffers=8GB, checkpoint_timeout = 15m, max_wal_size = 128GB
  • ZFS tuning: full_page_writes=off, wal_init_zero = off, wal_recycle = off

Both machines are running kernel 5.17.11 and zfs 2.1.4 or 2.1.5.

On the i5 machine we can test different RAID configurations too. Both zfs and btrs allow using multiple devices directly, for ext4/xfs we can use mdraid. The names (and implementations) of the RAID levels differ a bit, so here’s a rough mapping:

name (data disks)

mdraid

btrfs

zfs

striping (N)

raid0

raid0

striping

mirroring (1)

raid1

raid1*

mirroring

raid10 (N/2)

raid10

raid10

(mirroring + striping)

raid5 (N-1)

raid5

raid5

raidz1

raid6 (N-2)

raid6

raid6

raidz2

raid7 (N-3)

   

raidz3

Note: The number in the first column means number of “data bearing” disks. For example with striping all N disks are used to store data. With raid5, one of the drives is used to store parity information, so we only use (N-1) data disks. With raid10, we keep a copy for each piece of data, so we have N/2 capacity.

Note: btrfs defines raid1 a bit differently from mdraid, as it means “2 copies” and not “N copies”, which makes it more like raid10 than mirroring. Keep this in mind when interpreting the results/charts.

Note: btrfs implements raid5/6, but this support is considered experimental. It’s been like this for years, and considering the change is more to discourage people from using btrfs with RAID5/6 than to fix the issues, I’m not holding my breath. It’s available, so let’s test it but keep this in mind.

For more detailed description of the various RAID levels, see here.

Some basic file system tuning was performed to ensure proper alignment, stripe/stride and mount options. For ZFS, the tuning is a bit more extensive and sets this for the whole pool:

  • recordsize=8K
  • compression=lz4
  • atime=off
  • relatime=on
  • logbias=latency
  • redundant_metadata=most

All the scripts (and results) from this benchmark are available on my github. This includes setup of the RAID devices/pools, mounting etc.

Note: If you have a suggestion what other mount options to use (or any other optimization idea), let me know. I picked the optimizations that I think matter the most, but maybe I was wrong and some of those options would make a big difference.


i5 (6 x 100GB SATA SSD)

Let’s look at the smaller machine first, comparing the file systems on multiple devices in various RAID configurations. These are simple “total throughput” numbers, from sufficiently long pgbench runs (read-only: 15 minutes read-write: 60 minutes) for different scales.


read-only

For read-only runs, the results are pretty even, both when the data fits into RAM (scale 250 is ~4GB) and when it gets much larger (scale 5000 is ~75GB).

1

2

The main difference seems to be that zfs is consistently slower—I don’t know why exactly, but considering this affects even the case when everything fits into RAM (or ARC), my guess is this is due to using more CPU than the “native” Linux file systems. This machine only has 4 cores, so this may matter.

raid

zfs/default

zfs/tuned

btrfs

ext4

xfs

mirror

69531

70153

101083

99590

97654

raidz1

70339

70673

100078

98959

99992

raidz2

71367

70145

99138

100893

100689

raidz3

70126

70717

     

raid10

70747

71381

99963

101576

101205

stripe

70866

70383

101327

101024

97265

Table: Results for i5/scale 250/read-only

 

raid

zfs/default

zfs/tuned

btrfs

ext4

xfs

mirror

22983

23080

34806

   

raidz1

21617

21785

34889

39680

40401

raidz2

21645

21796

34729

39566

39979

raidz3

21682

21662

     

raid10

22647

22585

35005

41899

41638

stripe

22720

22628

34906

40145

40528

Table: Results for i5/scale 5000/read-only

 

read-write

For read-write tests, the results are much less even, both for all dataset sizes.
 

3

4

There are a couple interesting observations. Firstly, ext4/xfs clearly win, at least when it comes to total throughput (we’ll look at other stuff in a minute). Secondly, for the small data set ZFS is almost as fast as ext4/xfs, which is great—this covers a lot of practical use cases, because even if you have a lot of data, chances are you only access a tiny subset.

The unfortunate observation is that btrfs clearly underperforms, for some reason. This is particularly visible for the smaller data set, where btrfs tops at ~5000 tps, while the other filesystems achieve much higher (2-3x) throughputs.

Note: This is the place to remember btrfs defines raid1 differently, which means the “good” result for the larger data set is not that great because instead of 6 copies btrfs only keeps 2 (thus having 3x more data disks than the other file systems).

raid

zfs/default

zfs/tuned

btrfs

ext4

xfs

mirror

6386

7381

4839

9596

9420

raidz1

10297

11088

3393

11650

11835

raidz2

9255

10076

2898

11035

11433

raidz3

8109

8995

     

raid10

12348

12381

4857

14914

14652

stripe

13618

13861

5020

17335

16881

Table: Results for i5/scale 250/read-write

raid

zfs/default

zfs/tuned

btrfs

ext4

xfs

mirror

889

1077

1982

   

raidz1

1967

2470

1564

3333

3487

raidz2

1760

2141

1436

2492

2683

raidz3

1519

1815

     

raid10

1933

2370

2013

4491

4502

stripe

3064

3917

2199

5826

5891

Table: Results for i5/scale 5000/read-write

 

xeon (1x NVMe SSD)

Now, let’s look at a “bigger” machine, which however only has a single SSD (NVMe) device. So we can’t test various RAID levels, which means we can present the results in two simple charts.

5

6

scale

zfs/default

zfs/tuned

btrfs

ext4

xfs

100

494854

488757

486214

488456

498001

1000

353343

356491

424452

425123

422971

10000

93711

92454

125503

139685

143471

Table: read-only results

scale

zfs/default

zfs/tuned

btrfs

ext4

xfs

100

61028

64057

33352

85291

85765

1000

32568

36066

27952

58736

58991

10000

3822

6068

6541

30495

30977

Table: read-write results

For the read-only tests, the results are (again) pretty even - ZFS is a bit slower than the other file systems, particularly for the larger scales, similarly to the smaller machine.

For read-write tests, the differences are much more dramatic. EXT4/XFS are the clear winner. ZFS keeps pace on the two smaller data sets (that fit into shared buffers / RAM), but on the scale 10000 it drops to only ~6k tps (compared to ~30k tps for ext4/xfs). Btrfs performs quite poorly—on the small/medium data sets it’s somewhat competitive with ZFS, but other than that it’s clearly underperforming.


Throughput over time

There’s one important thing the throughput results presented so far ignore—stability of the throughput, i.e. how it evolves over time. Imagine two systems running the same benchmark:

  • system A does 15k tps consistently, with minimal variation during the benchmark run
  • system B does 30k tps in the first half, and then gets stuck and does 0 tps

Both systems do 15k tps on average, and so would have the same chart in the charts presented so far. But I assume we’d agree system A has much more consistent and predictable performance.

So let’s look not just at the total throughput, but also at how throughput develops over time. The charts in this section show throughput for each second during a 60-minute read-write benchmark (blue), with a running average over 15-second intervals (red).

This should give you some idea how much the throughput fluctuates, and also trends / patterns (degradation over time, impact of checkpoints…).


i5 / scale 1000

All results in this section are from RAID0/striping setup. For scale 1000 (exceeds RAM, but small enough compared to wal_max_size), we get this:

7

8

9

10

11

12

--------

13

14

15

16

It’s immediately obvious that ZFS has exceptionally clean and consistent behavior - with the default config there’s some checkpoint impact, but disabling FPW makes that go away. As a DBA, this is what you want to see on your systems—minimum differences (jitter) during the whole benchmark run.

EXT4/XFS achieve higher throughput (~7.5k tps vs. 6.5k tps, so ~20% increase), but the jitter is clearly much higher. The per-second throughput varies roughly between 5k and 9k tps—not great, not terrible. There’s also clear checkpoint impact, but we can’t disable FPW to eliminate this—we could only make checkpoints less frequent by increasing the timeout / WAL size.

BTRFS has more issues, though—the throughput is lower than with ZFS while the jitter is much worse than with EXT4/XFS. On average (the red line) it’s not that bad, but it regularly drops close to 0, which implies high latency for some transactions.


i5 / scale 5000

Now let’s look at the “large” data set (much larger than RAM).

17

18

19

20

21

22

--------

23

24

25

26

The first observation is that the impact of checkpoints is much less visible—both for ZFS with the default config and EXT4/XFS. This is due to the dataset size and random access—this means the number of FPW produced can’t drop before hitting the WAL limit (and starting the next checkpoint).

For BTRFS, the situation is much worse than on the medium (scale 1000) data set—not only the jitter is even worse, but the throughput is gradually dropping over time—it starts close to 4000 tps, but at the end it drops to only about 2000 tps.


xeon / scale 1000

Now let’s look at throughput on the larger machine, with a single NVMe device. On the medium scale (larger than shared buffers, fits into RAM) it looks like this:

27

28

29

30

31

32

--------

33

34

35

36

On ZFS with default configuration, the checkpoint pattern is clearly visible. With the tuned configuration, the pattern disappears and the behavior gets much more consistent. Not as smooth as on the smaller machine with multiple devices, but better than the other file systems.

There’s very little difference between EXT4 and XFS, both in total throughput and behavior over time. The total throughput is better than with ZFS (40k vs 60k), but the jitter is more severe.

For BTRFS, the overall throughput is fairly low (~30k tps), while the jitter is somewhat better and worse than for EXT4/XFS at the same time. It seems mostly closer to the average throughput, but at the same time it often drops to 0.


xeon / scale 10000

On the large (~150GB) data set, the results look like this:

37

38

39

40

41

42

--------

43

44

45

46

This is similar to what we saw on the smaller machine—checkpoint pattern disappears, due to the amount of FPW we have to write to WAL.

On ZFS with the tuned configuration, the jitter increases significantly. I’m not sure why, but I’d say it’s still fairly consistent (compared to the other file systems).

For EXT4/XFS, the throughput drops to ~50% (compared to the medium scale) while the jitter remains about the same—I’d argue this is a pretty good result.

With BTRFS we see the same unfortunate behavior as on the smaller machine - with the large data set the throughput gradually decreases, with significant jitter.


Conclusions

So, what have we learned? The first observation is that there are significant differences between the evaluated file systems—it’s hard to pick a clear winner, though.

The traditional file systems (EXT4/XFS) perform very well in OLTP workloads, at least in total throughput. ZFS performs pretty well too—it’s a bit slower in terms of throughput, but the behavior is very consistent with minimal jitter (particularly with disabled full-page writes). It’s also quite flexible—for example it allows moving ZIL to a separate device, which should improve the throughput.

As for BTRFS, the results are not great—I did a similar OLTP benchmark a couple years ago, and this time BTRFS performed a bit better, in fact. However, the overall consensus seems to be that BTRFS is not particularly well suited for databases, and others observed this too. Which is a bit unfortunate, as some of the features (higher resilience, easy snapshotting) are very useful for databases.

The last thing to keep in mind when reading the results is to realize these are stress tests, fully saturating the I/O subsystem. That’s not what would be happening on most production systems—once the I/O gets saturated for extended periods of time, you’d be already thinking about upgrading the system to reduce the storage load. Which typically means the latencies should improve, etc.

Share this