Native ZFS VDEV for Object Storage (OpenZFS Summit)

(zettalane.com)

84 points | by suprasam 6 hours ago ago

18 comments

  • kev009 an hour ago

    I am curious about the inverse, using the dataset layer, to implement some higher level things like objects for an S3 compatible storage or pages directly for an RDBMS. I seem to remember hearing rumblings about that but it is hard to dredge up.

  • infogulch 2 hours ago

    How suitable would this be as a zfs send target to back up your local zfs datasets to object storage?

  • PunchyHamster 3 hours ago

    FS metrics without random IO benchmark are near meaningless, sequential read is best case for basically every file system and it's essentially "how fast you can get things from S3" in this case

    • gigatexal 2 hours ago

      Yup. IIRC low queue depth random Reads are king for desktop usage

  • yjftsjthsd-h 2 hours ago

    Could someone possibly compare this to https://www.zerofs.net/nbd-devices ("zpool create mypool /dev/nbd0 /dev/nbd1 /dev/nbd2")

    • 0x457 an hour ago

      I know my missing something, but can't figure out: why not just one device?

      • yjftsjthsd-h an hour ago

        IIRC the point is that each NBD device is backed by a different S3 endpoint, probably in different zones/regions/whatever for resiliency.

        Edit: Oops, "zpool create global-pool mirror /dev/nbd0 /dev/nbd1" is a better example for that. If it's not that, I'm not sure what that first example is doing.

        • 0x457 an hour ago

          In context of real AWS S3, I can see raid 0 being useful in this scenario, but in mirror that seems like too much duplication and cross-region replication like this going to introduce significant latency[citation needed]. AWS provides that for S3 already.

          I can see it on not real S3 though.

          • mgerdts 14 minutes ago

            Mirroring between s3 providers would seemingly give protection against your account being locked at one of them.

            I expect this becomes most interesting with l2arc and cache (zil) devices to hold the working set and hide write latency. Maybe would require tuning or changes to allow 1m writes to use the cache device.

  • curt15 3 hours ago

    How does this relate to the work presented a few years ago by the ZFS devs using S3 as object storage? https://youtu.be/opW9KhjOQ3Q?si=CgrYi0P4q9gz-2Mq

    • magicalhippo 2 hours ago

      Just going by the submitted article, it seems very similar in what it achieves, but seems to be implemented slightly differently. As I recall the DelphiX solution did not use a character device to communicate with the user-space S3 service, and it relied on a local NVMe backed write cache to make 16kB blocks performant by coalescing them into large objects (10 MB IIRC).

      This solution instead seems to rely on using 1MB blocks and store those directly as objects, alleviating the intermediate caching and indirection layer. Larger number of objects but less local overhead.

      DelphiX's rationale for 16 kB blocks was that their primary use-case was PostgreSQL database storage. I presume this is geared for other workloads.

      And, importantly since we're on HN, DelphiX's user-space service was written in Rust as I recall it, this uses Go.

    • tw04 3 hours ago

      AFAIK it was never released, and it used FUSE, it wasn’t native.

  • doktor2u 4 hours ago

    That’s brilliant! Always amazed at how zfs keeps morphing and stays relevant!

  • glemion43 3 hours ago

    I do not get it.

    Why would I use zfs for this? Isn't the power of zfs that it's a filesystem with checksum and stuff like encryption?

    Why would I use it for s3?

    • mustache_kimono 3 hours ago

      > Why would I use it for s3?

      You have it the wrong way around. Here, ZFS uses many small S3 objects as the storage substrate, rather than physical disks. The value proposition is that this should be definitely cheaper and perhaps more durable than EBS.

      See s3backer, a FUSE implementation of similar: https://github.com/archiecobbs/s3backer

      See prior in kernel ZFS work by Delphix which AFAIK was closed by Delphix management: https://www.youtube.com/watch?v=opW9KhjOQ3Q

      BTW this appears to be closed too!

    • bakies 3 hours ago

      I've got a massive storage server built that I want to run s3 protocol on it. It's already running ZFS. This is exactly what I want.

      zfs-share already implements SMB and NFS.

      • 0x457 2 hours ago

        This is not what it is. This is building zpool on top of an S3 backend (vdev).

        Not sure what is the use case out of my ignorance, but I guess one can use it to `zfs send` backups to s3 in a very neat manner.

        • lkjdsklf an hour ago

          One use case that comes to mind is backups. I can have a zpool created backed by a S3 vdev and then use zfs send | zfs recv to backup datasets to S3 ( or the billion other S3 like providers)

          Saves me the step of creating an instance with EBS volumes and snapshotting those to S3 or whatever

          haven't done the math at all on whether that's cost effective, but that's the usecase that comes to mind immediately