Amazon S3 Adds Put-If-Match (Compare-and-Swap)

(aws.amazon.com)

168 points | by Sirupsen 3 hours ago ago

36 comments

  • Sirupsen an hour ago

    To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!

    https://turbopuffer.com/blog/turbopuffer

  • JoshTriplett an hour ago

    It's also possible to enforce the use of conditional writes: https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3...

    My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.

    • anotheraccount9 9 minutes ago

      Could you use a meta field from the object and save the hash in it, running a compare from it?

    • cmeacham98 38 minutes ago

      Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?

      • JoshTriplett 9 minutes ago

        > Is there any reason you can't enforce that restriction on your side?

        I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.

        > Or are you saying you want S3 to automatically set the name for you based on the hash?

        I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.

    • jiggawatts 39 minutes ago

      That will probably never happen because of the fundamental nature of blob storage.

      Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.

      Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.

      What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.

      • losteric 25 minutes ago

        Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.

  • 1a527dd5 2 hours ago

    Be still my beating heart. I have lived to see this day.

    Genuinely, we've wanted this for ages and we got half way there with strong consistency.

    • ncruces 2 hours ago

      Might finally be possible to do this on S3: https://pkg.go.dev/github.com/ncruces/go-gcp/gmutex

    • paulddraper 2 hours ago

      So....given CAP, which one did they give up

      • johnrob 2 hours ago

        I’d wager that the algorithm is slightly eager to throw a consistency error if it’s unable to verify across partitions. Since the caller is naturally ready for this error, it’s likely not a problem. So in short it’s the P :)

        • alanyilunli 2 hours ago

          Shouldn't that be the A then? Since the network partition is still there but availability is non-guaranteed.

          • johnrob an hour ago

            Yes, definitely. Good point (I was knee jerk assuming the A is always chosen and the real “choice” is between C and P).

            • rhaen 4 minutes ago

              Well, P isn't really much of a choice, I don't think you can opt out of acts of god.

      • moralestapia an hour ago

        A tiny bit of availability, unnoticeable at web scale.

  • vlovich123 2 minutes ago

    I implemented that extension in R2 at launch IIRC. Thanks for catching up & helping move distributed storage applications a meaningful step forward. Intended sincerely. I'm sure adding this was non-trivial for a complex legacy codebase like that.

  • koolba 3 hours ago

    This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.

    • IgorPartola 3 hours ago

      Rename for objects and “directories” also. Atomic.

  • CubsFan1060 an hour ago

    I feel dumb for asking this, but can someone explain why this is such a big deal? I’m not quite sure I am grokking it yet.

    • lxgr an hour ago

      If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.

      As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:

      - Download the current database copy

      - Perform your write locally

      - Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.

      - If you get success, consider the transaction successful.

      - If you get failure, go back to step 1 and try again.

    • Sirupsen an hour ago

      The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.

      • CubsFan1060 an hour ago

        Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)

        • poincaredisk 18 minutes ago

          I imagine, for it to make sense, that the comparison is done at the last possible moment, before atomically swapping the file contents.

  • offmycloud 2 hours ago

    If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?

    I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.

    • revnode an hour ago

      MD5 hash collisions are unlikely to happen at random. The defect was that you can make it happen purposefully, making it useless for security.

  • wanderingmind 34 minutes ago

    Does this mean, in theory we will be able to manage multiple concurrent writes/updates to s3 without having to use new solutions like Regatta[1] that was recently launched?

    https://news.ycombinator.com/item?id=42174204

  • dvektor 25 minutes ago

    [rejected] error: failed to push some refs to remote repository

    Finally we can have this with s3 :)

  • sillysaurusx 3 hours ago

    Finally. GCP has had this for a long time. Years ago I was surprised S3 didn’t.

    • mannyv 11 minutes ago

      GCP still doesn't have triggers out of beta last time i checked (which was a while ago).

      • fragmede 9 minutes ago

        Gmail was in beta for five years, I don't think that label really means anything.

    • ncruces 2 hours ago

      GCS is just missing x-amz-copy-source-range in my book.

      Can we have this Google?

      Please?

  • gravitronic 2 hours ago

    First thing I thought when I saw the headline was "oh! I should tell Sirupsen"

  • rrr_oh_man an hour ago

    Could anybody explain for the uninitiated?

    • msoad 42 minutes ago

      It ensures that when you try to upload (or “put”) a new version of a file, the operation only succeeds if the file on the server still has the exact version (ETag) you specify. If someone else has updated the file in the meantime, your upload is blocked to prevent overwriting their changes.

      This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.

      This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn

  • tonymet 2 hours ago

    good example of how a simple feature on the surface (a header comparison) requires tremendous complexity and capacity on the backend.

    • akira2501 2 hours ago

      S3 is rated as "durable" as opposed to "best effort." It has lots of interesting guarantees as a result.