Understanding ZFS Scrubs and Data Integrity

(klarasystems.com)

53 points | by zdw 6 days ago ago

22 comments

thatcks 5 hours ago

The article is correct but it downplays an important limitation of ZFS scrubs when it talks about how they're different from fsck and chkdsk. As the article says (in different words), ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems. Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems, and as it stands today ZFS doesn't have anything that either checks or corrects these. Sometimes you find them through incorrect results; sometimes you discover they exist through ZFS assertion failures triggering kernel panics.

(We run ZFS in production and have not been hit by these issues, at least not that we know about. But I know of some historical ZFS bugs in this area and mysterious issues that AFAIK have never been fully diagnosed.)

[-]

wereHamster an hour ago

A loooong time age (OpenSolaris days) I had a system that had corrupted its zfs. No fsck was available because the developers claimed (maybe still do) that it's unnecessary.

I had to poke around the raw device (with dd and such) to restore the primary superblock with one of the copies (that zfs keeps in different locations on the device). So clearly the zfs devs thought about the possibility of a corrupt superblock, but didn't feel the need to provide a tool to compare the superblocks and restore one from the other copies. That was the point when I stopped trusting zfs.

Such arrogance…

mustache_kimono 4 hours ago

    "Scrubs differ significantly from traditional filesystem checks. Tools such as fsck or chkdsk examine logical structures and attempt to repair inconsistencies related to directory trees, allocation maps, reference counts, and other metadata relationships. ZFS does not need to perform these operations during normal scrubs because its transactional design ensures metadata consistency. Every transaction group moves the filesystem from one valid state to another. The scrub verifies the correctness of the data and metadata at the block level, not logical relationships."

> ZFS scrubs do not check filesystem objects for correctness and consistency; it only checks that they have the expected checksum and so have not become corrupted due to disk errors or other problems

A scrub literally reads the object from disk. And, for each block, the checksums are read up the tree. The object is therefore guaranteed to be correct and consistent at least re: the tree of blocks written.

> Unfortunately it's possible for ZFS bugs and issues to give you filesystem objects that have problems

Can you give a more concrete example of what you mean? It sounds like you have some experience with ZFS, but "ZFS doesn't have an fsck" is also some truly ancient FUD, so you will forgive my skepticism.

I'm willing to believe that you request an object and ZFS cannot return that object because of ... a checksum error or a read error in a single disk configuration, but what I have never seen is a scrub that indicates everything is fine, and then reads which don't return an object (because scrubs are just reads themselves?).

Now, are things like pool metadata corruption possible in ZFS? Yes, certainly. I'm just not sure fsck would or could help you out of the same jam if you were using XFS or ext4. AFAIK fsck may repair inconsistencies but I'm not sure it can repair metadata any better than ZFS can?

[-]

agapon 3 hours ago

Generally, it's possible to have data which is not corrupted but which is logically inconsistent (incorrect).

Imagine that a directory ZAP has an entry that points to a bogus object ID. That would be an example. The ZAP block is intact but its content is inconsistent.

Such things can only happen through a logical bug in ZFS itself, not through some external force. But bugs do happen.

If your search through OpenZFS bugs you will find multiple instances. Things like leaking objects or space, etc. That's why zdb now has support for some consistency checking (bit not for repairs).

[-]

mustache_kimono 3 hours ago

> Imagine that a directory ZAP has an entry that points to a bogus object ID. That would be an example. The ZAP block is intact but its content is inconsistent.

The above is interesting and fair enough, but a few points:

First, I'm not sure that makes what seems to be the parent's point -- that scrub is an inadequate replacement for an fsck.

Second, I'm really unsure if your case is the situation the parent is referring to. Parent seems to be indicating actual data loss is occurring. Not leaking objects or space or bogus object IDs. Parent seems to be saying she/he scrubs with no errors and then when she/he tries to read back a file, oops, ZFS can't.

ori_b 4 hours ago

Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.

[-]

mustache_kimono 4 hours ago

> Imagine a race condition that writes a file node where a directory node should be. You have a valid object with a valid checksum, but it's hooked into the wrong place in your data structure.

A few things: 1) Is this an actual ZFS issue you encountered or is this a hypothetical? 2) And -- you don't imagine this would be discovered during a scrub? Why not? 3) But -- you do imagine it would be discovered and repaired by an fsck instead? Why so? 4) If so, wouldn't this just be a bug, like a fsck, not some fundamental limitation of the system?

FWIW I've never seen anything like this. I have seen Linux plus a flaky ALPM implementation drop reads and writes. I have seen ZFS notice at the very same moment when the power dropped via errors in `zpool status`. I do wonder if ext4's fsck or XFS's fsck does the same when someone who didn't know any better (like me!) sets the power management policy to "min_power" or "med_power_with_dipm".

klempner 5 hours ago

>HDDs typically have a BER (Bit Error Rate) of 1 in 1015, meaning some incorrect data can be expected around every 100 TiB read. That used to be a lot, but now that is only 3 or 4 full drive reads on modern large-scale drives. Silent corruption is one of those problems you only notice after it has already done damage.

While the advice is sound, this number isn't the right number for this argument.

That 10^15 number is for UREs, which aren't going to cause silent data corruption -- simple naive RAID style mirroring/parity will easily recover from a known error of this sort without any filesystem layer checksumming. The rates for silent errors, where the disk returns the wrong data that benefit from checksumming, are a couple of orders of magnitude lower.

itchingsphynx 6 hours ago

>Most systems that include ZFS schedule scrubs once per month. This frequency is appropriate for many environments, but high churn systems may require more frequent scrubs.

Is there a more specific 'rule of thumb' for scrub frequency? What variables should one consider?

[-]

toast0 6 hours ago

Once a month seems like a reasonable rule of thumb.

But you're balancing the cost of the scrub vs the benefit of learning about a problem as soon as possible.

A scrub does a lot of I/O and a fair amount of computing. The scrub load competes with your application load and depending on the size of your disk(s) and their read bandwidth, it may take quite some time to do the scrub. There's even maybe some potential that the read load could push a weak drive over the edge to failure.

On my personal servers, application load is nearly meaningless, so I do an about monthly scrub from cron which I think will only scrub one zpool at a time per machine, which seems reasonable enough to me. I run relatively large spinning disks, so if I scrubbed on a daily basis, the drives would spend most of the day scrubbing and that doesn't seem reasonable. I haven't run ZFS in a work environment... I'd have to really consider how the read load impacted the production load and if scrubbing with limits to reduce production impact would complete in a reasonable amount of time... I've run some systems that are essentially alwayd busy and if a scrub would take several months, I'd probably only scrub when other systems indicate a problem and I can take the machine out of rotation to examine it.

If I had very high reliability needs or a long time to get replacement drives, I might scrub more often?

If I was worried about power consumption, I might scrub less often (and also let my servers and drives go into standby). The article's recommendation to scan at least once every 4 months seems pretty reasonable, although if you have seriously offline disks, maybe once a year is more approachable. I don't think I'd push beyond that, lots of things don't like to sit for a year and then turn on correctly.

kanbankaren 3 hours ago

Once a month might be too high because HDDs are rated at ~ 180 TB workload/year. Remember, the workload/year limit includes read & writes and doesn't vary much by capacity, so a 10 TB HDD scrubbed monthly consumes 67% of the workload, let alone any other usage.

Scrubbing every quarter is usually sufficient without putting high wear on the HDD.

[-]

Hakkin 2 hours ago

A scrub only reads allocated space, so in your 10TB example, a scrub would only read whatever portion of that 10TB is actually occupied by data. It's also usually recommended to keep your usage below 80% of the total pool size to avoid performance issues, so the worst case in your scenario would be more like ~53% assuming you follow the 80% rule.

[-]

formerly_proven an hour ago

Is the 80% rule real or just passed down across decades like other “x% free” rules? Those waste enormous amounts of resources on modern systems and I kind of doubt ZFS actually needs a dozen terabytes or more of free space in order to not shit the bed. Just like Linux doesn’t actually need >100 GB of free memory to work properly.

atmosx 5 hours ago

Once a month is fine ("/etc/cron.monthly/zfs-scrub"):

    #!/bin/bash
    #
    # ZFS scrub script for monthly maintenance
    # Place in /etc/cron.monthly/zfs-scrub
    
    POOL="storage"
    TAG="zfs-scrub"
    
    # Log start
    logger -t "$TAG" -p user.notice "Starting ZFS scrub on pool: $POOL"
    
    # Run the scrub
    if /sbin/zpool scrub "$POOL"; then
        logger -t "$TAG" -p user.notice "ZFS scrub initiated successfully on pool: $POOL"
    else
        logger -t "$TAG" -p user.err "Failed to start ZFS scrub on pool: $POOL"
        exit 1
    fi
    
    exit 0

[-]

k_bx 3 hours ago

Didn't know about the logger script, looks nice. Can it wrap the launch of the scrub itself so that it logs like logger too, or do you separately track its stdout/stderr when something happens?

update: figured how you can improve that call to add logs to logger

[-]

nubinetwork 24 minutes ago

Scrub doesn't log anything by default, you run it and it returns quickly... you have to get the results out of zpool status or through zed.

chungy 4 hours ago

That script might do with the "-w" parameter passed to scrub. Then "zpool scrub" won't return until the scrub is finished.

ssl-3 6 hours ago

The cost of a scrub is just a flurry of disk reads and a reduction in performance during a scrub.

If this cost is affordable on a daily basis, then do a scrub daily. If it's only affordable less often, then do it less often.

(Whatever the case: It's not like a scrub causes any harm to the hardware or the data. It can run as frequently as you elect to tolerate.)

[-]

agapon 3 hours ago

With HDDs, it's also mechanical wear and increased chance of a failure. SSDs are not fully immune to increased load either.

[-]

ssl-3 3 hours ago

Is there any evidence that suggests that reading from a hard drive (instead of it just spinning idle) increases physical wear in any meaningful way? Likewise, is there any evidence of this for solid-state storage?

[-]

rcxdude an hour ago

Yes. Hard drives have published "Annualized Workload Rate" ratings, which are in TB/year, and the manufacturers state there is no difference between reads and writes for the purpose of this rating.

(https://www.toshiba-storage.com/trends-technology/mttf-what-...)

For SSDs, writes matter a lot more. Reads may increase the temperature of the drive, so they'll have some effect, but I don't think I've seen a read endurance rating for an SSD.

nubinetwork 6 hours ago

Total pool size and speed. Less data scrubs faster, as do faster disks or disk topology (a 3 way stripe of nvme will scrub faster than a single sata ssd)

For what its worth, I scrub daily mostly because I can. It's completely overkill, but if it only takes half an hour, then it can run in the middle of the night while I'm sleeping.