Even with O_DIRECT and aligned blocks, I still don't understand how the storage engine can return a "successful commit" to the client without a sync at some point, because a sync (IIRC) is the only way to guarantee an ATA/NVMe FUA command is sent, and the device write cache/buffer is committed.
:-/ it’s a statistical guarantee in the first place. A successful commit in a durable storage engine just needs to achieve some finite level of durability, like “10^-7 probability of loss per year”. The durability is a property of the whole system, and it is possible to achieve durability without fsync, you just may have a hard time explaining what the durability is, how you calculated it, and what the evidence or justifications are for the numbers you give.
Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.
To step back a bit, the device still has a filesystem on it, and the structures described here are files within the filesystem? Just you're able to write directly into them, bypassing the filesystem layer, because you've constrained yourself to writes that don't require updating other parts of the filesystem structure?
Author here. This is not a general argument against fsync; the design depends on SSD-only deployment, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.
Your approach looks interesting but I was curious when you talk about path-based splitting for ART, do you literally mean always on "/"? I know S3 directory buckets always use /, but the classical S3 model had no natural separator character and I was wondering if supporting those styles of prefix or custom delimiter queries suffered any impediment in your approach.
Bookmarked your whole blog for later consumption, interesting stuff!
I wonder why this is not more common. LVM is easy to set up, and it's already common to allocate volumes for things like disk images for VMs, so why not databases?
Because the speed increase is - on modern, properly tuned filesystems - surprisingly small, due to how RDBMS's manage their pool; by working on large container files, they avoid most of the filesystem overhead.
> fsync doesn’t just sync the file’s data, it syncs every piece of metadata the file depends on: ... directory entry
Famously not, as the man page says.
It is also said later in the article:
> POSIX strictly requires a parent-directory fsync to make a newly created file’s existence durable.
So I'm not sure why the dirent sync is claimed earlier.
Even with O_DIRECT and aligned blocks, I still don't understand how the storage engine can return a "successful commit" to the client without a sync at some point, because a sync (IIRC) is the only way to guarantee an ATA/NVMe FUA command is sent, and the device write cache/buffer is committed.
:-/ it’s a statistical guarantee in the first place. A successful commit in a durable storage engine just needs to achieve some finite level of durability, like “10^-7 probability of loss per year”. The durability is a property of the whole system, and it is possible to achieve durability without fsync, you just may have a hard time explaining what the durability is, how you calculated it, and what the evidence or justifications are for the numbers you give.
Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.
To step back a bit, the device still has a filesystem on it, and the structures described here are files within the filesystem? Just you're able to write directly into them, bypassing the filesystem layer, because you've constrained yourself to writes that don't require updating other parts of the filesystem structure?
Author here. This is not a general argument against fsync; the design depends on SSD-only deployment, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.
Your approach looks interesting but I was curious when you talk about path-based splitting for ART, do you literally mean always on "/"? I know S3 directory buckets always use /, but the classical S3 model had no natural separator character and I was wondering if supporting those styles of prefix or custom delimiter queries suffered any impediment in your approach.
Bookmarked your whole blog for later consumption, interesting stuff!
Am i understanding correctly that you are just targeting consistency and not durability?
Working with files is hard [1], and most of the complicity is from the fsync API. I am glad it can be eliminated from a kv storage engine.
[1] https://news.ycombinator.com/item?id=42805425
This is really great work. Kudos to the team for such an elegant solution.
Almost full-circle back to when Oracle took over the entire volume and implemented its own filesystem.
I wonder why this is not more common. LVM is easy to set up, and it's already common to allocate volumes for things like disk images for VMs, so why not databases?
Because the speed increase is - on modern, properly tuned filesystems - surprisingly small, due to how RDBMS's manage their pool; by working on large container files, they avoid most of the filesystem overhead.