To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?
> Is there any reason you can't enforce that restriction on your side?
I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.
> Or are you saying you want S3 to automatically set the name for you based on the hash?
I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.
That will probably never happen because of the fundamental nature of blob storage.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.
I’d wager that the algorithm is slightly eager to throw a consistency error if it’s unable to verify across partitions. Since the caller is naturally ready for this error, it’s likely not a problem. So in short it’s the P :)
I implemented that extension in R2 at launch IIRC. Thanks for catching up & helping move distributed storage applications a meaningful step forward. Intended sincerely. I'm sure adding this was non-trivial for a complex legacy codebase like that.
This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.
If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.
Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)
If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
Does this mean, in theory we will be able to manage multiple concurrent writes/updates to s3 without having to use new solutions like Regatta[1] that was recently launched?
It ensures that when you try to upload (or “put”) a new version of a file, the operation only succeeds if the file on the server still has the exact version (ETag) you specify. If someone else has updated the file in the meantime, your upload is blocked to prevent overwriting their changes.
This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.
This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn
To avoid any dependencies other than object storage, we've been making use of this in our database (turbopuffer.com) for consensus and concurrency control since day one. Been waiting for this since the day we launched on Google Cloud Storage ~1 year ago. Our bet that S3 would get it in a reasonable time-frame worked out!
https://turbopuffer.com/blog/turbopuffer
It's also possible to enforce the use of conditional writes: https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3...
My biggest wishlist item for S3 is the ability to enforce that an object is named with a name that matches its hash. (With a modern hash considered secure, not MD5 or SHA1, though it isn't supported for those either.) That would make it much easier to build content-addressible storage.
Could you use a meta field from the object and save the hash in it, running a compare from it?
Is there any reason you can't enforce that restriction on your side? Or are you saying you want S3 to automatically set the name for you based on the hash?
> Is there any reason you can't enforce that restriction on your side?
I'd like to set IAM permissions for a role, so that that role can add objects to the content-addressible store, but only if their name matches the hash of their content.
> Or are you saying you want S3 to automatically set the name for you based on the hash?
I'm happy to name the files myself, if I can get S3 to enforce that. But sure, if it were easier, I'd be thrilled to have S3 name the files by hash, and/or support retrieving files by hash.
That will probably never happen because of the fundamental nature of blob storage.
Individual objects are split into multiple blocks, each of which can be stored independently on different underlying servers. Each can see its own block, but not any other block.
Calculating a hash like SHA256 would require a sequential scan through all blocks. This could be done with a minimum of network traffic if instead of streaming the bytes to a central server to hash, the hash state is forwarded from block server to block server in sequence. Still though, it would be a very slow serial operation that could be fairly chatty too if there are many tiny blocks.
What could work would be to use a Merkle tree hash construction where some of subdivision boundaries match the block sizes.
Why does the architect of blob storage matter? The hash can be calculated as data streams in for the first write, before data gets dispersed into multiple physically stored blocks.
Be still my beating heart. I have lived to see this day.
Genuinely, we've wanted this for ages and we got half way there with strong consistency.
Might finally be possible to do this on S3: https://pkg.go.dev/github.com/ncruces/go-gcp/gmutex
So....given CAP, which one did they give up
I’d wager that the algorithm is slightly eager to throw a consistency error if it’s unable to verify across partitions. Since the caller is naturally ready for this error, it’s likely not a problem. So in short it’s the P :)
Shouldn't that be the A then? Since the network partition is still there but availability is non-guaranteed.
Yes, definitely. Good point (I was knee jerk assuming the A is always chosen and the real “choice” is between C and P).
Well, P isn't really much of a choice, I don't think you can opt out of acts of god.
A tiny bit of availability, unnoticeable at web scale.
I implemented that extension in R2 at launch IIRC. Thanks for catching up & helping move distributed storage applications a meaningful step forward. Intended sincerely. I'm sure adding this was non-trivial for a complex legacy codebase like that.
This combined with the read-after-write consistency guarantee is a perfect building block (pun intended) for incremental append only storage atop an object store. It solves the biggest problem with coordinating multiple writers to a WAL.
Rename for objects and “directories” also. Atomic.
I feel dumb for asking this, but can someone explain why this is such a big deal? I’m not quite sure I am grokking it yet.
If my memory of parallel algorithms class serves me right, you can build any synchronization algorithm on top of compare-and-swap as an atomic primitive.
As a (horribly inefficient, in case of non-trivial write contention) toy example, you could use S3 as a lock-free concurrent SQLite storage backend: Reads work as expected by fetching the entire database and satisfying the operation locally; writes work like this:
- Download the current database copy
- Perform your write locally
- Upload it back using "Put-If-Match" and the pre-edit copy as the matched object.
- If you get success, consider the transaction successful.
- If you get failure, go back to step 1 and try again.
The short of it is that building a database on top of object storage has generally required a complicated, distributed system for consensus/metadata. CAS makes it possible to build these big data systems without any other dependencies. This is a win for simplicity and reliability.
Thanks! Do they mention when the comparison is done? Is it before, after, or during an upload? (For instance, if I have a 4tb file in a multi part upload, would I only know it would fail as soon as the whole file is uploaded?)
I imagine, for it to make sense, that the comparison is done at the last possible moment, before atomically swapping the file contents.
If the default ETag algorithm for non-encrypted, non-multipart uploads in AWS is a plain MD5 hash, is this subject to failure for object data with MD5 collisions?
I'm thinking of a situation in which an application assumes that different (possibly adversarial) user-provided data will always generate a different ETag.
MD5 hash collisions are unlikely to happen at random. The defect was that you can make it happen purposefully, making it useless for security.
Does this mean, in theory we will be able to manage multiple concurrent writes/updates to s3 without having to use new solutions like Regatta[1] that was recently launched?
https://news.ycombinator.com/item?id=42174204
[rejected] error: failed to push some refs to remote repository
Finally we can have this with s3 :)
Finally. GCP has had this for a long time. Years ago I was surprised S3 didn’t.
GCP still doesn't have triggers out of beta last time i checked (which was a while ago).
Gmail was in beta for five years, I don't think that label really means anything.
GCS is just missing x-amz-copy-source-range in my book.
Can we have this Google?
…
Please?
First thing I thought when I saw the headline was "oh! I should tell Sirupsen"
Could anybody explain for the uninitiated?
It ensures that when you try to upload (or “put”) a new version of a file, the operation only succeeds if the file on the server still has the exact version (ETag) you specify. If someone else has updated the file in the meantime, your upload is blocked to prevent overwriting their changes.
This is especially useful in scenarios where multiple users or processes are working on the same data, as it helps maintain consistency and avoids accidental overwrites.
This is using the same mechanism as HTTP's `If-None-Match` header so it's easier to implement/learn
good example of how a simple feature on the surface (a header comparison) requires tremendous complexity and capacity on the backend.
S3 is rated as "durable" as opposed to "best effort." It has lots of interesting guarantees as a result.