One of the challenges when dealing with a new database design is that you don’t know things like how big the tables will end up being until they’re actually populated with a fair amount of data. But if the design has to factor in the eventual scalability concerns, you can’t deploy it to obtain that data until the estimation is done. One way around this is to aggressively prototype things. Use staging hardware for this purpose that new applications can live on temporarily while sorting details like this out. You can just factor in that you’ll need to move the app and possibly redesign it after a few months, when you have a better idea what data is going to show up in it.
The other way to get around this chicken/egg problem is to write a data generator. Construct enough sample data by hand to see what it looks like, how dense it is, and how its values are distributed. Then write something that takes those statistics and produces a larger data set similar to it. You’ll never get that to be perfect, but it doesn’t have to be. Generating giant data sets, even with some flaws, is still the best way available to do database size estimation. There are so many sources of overhead it’s hard to account for that any measured table and index sizes, based on something like your data, is going to be way more accurate than any other approach. There’s a reason why I end up answering many questions about performance related concerns by using pgbench to create a database of the appropriate size first.
Data generation isn’t easy though. Generating realistic timestamps in particular is always annoying. And no matter how efficiently you think you’ve written them, they usually take longer than you expect to run–and even longer to get the resulting data into a database and properly indexed.
And that continues to be the case no matter how many times you’ve done this, because even if you do everything right Murphy’s Law will intrude to make the job painful regardless. My computers at home all are built from relatively cheap PC hardware. Not the cheapest stuff available–I do have standards–but certainly not using the same quality I would recommend people look for in a server. This week’s data generation problem reminded of why better hardware is worth how much it costs for business critical work.
After generating a few billion rows of data and watching that import for 10 hours, I was not pleased to have the whole job abort like this:
psql:load.sql:10: ERROR: invalid input syntax for type timestamp: "201^Q-04-14 12:17:55"
CONTEXT: COPY testdata, line 181782001, column some_timestamp: "201^Q-04-14 12:17:55"
Turns out, somewhere in the middle of writing the 250GB of test data I generated, one of the lines output was corrupted. Two bits flipped, and the data written out was wrong. I don’t know where that happened for sure.
Memory is the most likely suspect. Real servers use ECC RAM, and for good reason. If you monitor a system with a lot of RAM, the number of errors silently corrected by ECC can be shocking. The RAM I use at home is good, but error rates on any memory without error detection/correction capabilities are going to be higher than you might like–and never detected unless they happen in code that leads to a system crash. Data generation work is good at exposing these problems, because it typically puts at least one CPU on your server under heavy load for potentially days in the row. If there’s any instability in your system, getting the system hot and letting it go for a long time will aggravate it.
A second layer of protection against this sort of corruption is to put checksums on data being written to disk, to protect against errors writing and then reading the data back again. The block checksumming done by the ZFS filesystem is one of the better implementations of that. In my case, whether I’d used ZFS or not may not have made any difference. If the bits got flipped before the block checksum happened, I’d have written junk data out–with a junk checksum to match it–regardless.
My next step was to use the split utility to take my giant file and break it into more bite-size pieces–another couple of hours to wait for that to finish. Then I could kick off loading the good files while I fixed the bad one.
Given the resulting files were 13GB each and my one server has 16GB of RAM, though I could fix this one character typo using vi. Theoretically that should be the case, but I’m starting to have my doubts given how long I have been waiting for the file to write back out again afterwards:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21495 gsmith 20 0 8542m 8.3g 1544 R 98 53.0 419:56.70 vi datafile-ag
That’s a solid 7 hours I’ve been waiting just for this one character typo to get fixed, so I can finish loading the test data. Like I was saying, serious data generation is never easy.