I think it's now much easier to achieve than a year ago. The critical one is conditional writes on new objects, because otherwise you can't safely create transaction logs in the presence of timeouts. This is not enough though.
My approach on S3 would be to ensure to modify the ETag of an object whenever other transactions looking at it must be blocked. This makes it easier to use conditional reads (https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...) on COPY or GET operations.
For write, I would use PUT on a temporary staging area and then conditional COPY + DELETE afterward. This is certainly slower than GCS, but I think it should work.
Locking without modifying the object is the part that needs some optimization though.
Not a full solution, but seeing the OP seeks to be a key-value store (versus full RDBMS? despite the comparisons with Spanner and Postgres?), important to weigh how Rockset (also mainly KV store) dealt with S3-backed caching at scale:
Congrats on reinventing the data lake? This is actually how most of the newer generations of "cloud native" databases work, where they separate compute and storage. The key is that they have a more sophisticated caching layer so that the latency cost of a query can be amortized across requests.
It's my understanding that the newer generation of data lakes still make use of a tiny, strongly consistent metadata database to keep track of what is where. This is orders of magnitudes smaller than what you'd have by putting everything in the same database, but it's still there. This is also the case in newer data streaming platforms (e.g. https://www.warpstream.com/blog/kafka-is-dead-long-live-kafk...).
I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.
OK, thanks for the reference. Yeah, so indeed separating storage and compute is nothing new. Definitely not claiming I invented that :)
And as you mention, Datomic uses DynamoDB as well (so, not a pure s3 solution). What I'm proposing is to only use object storage for everything, pay the price in latency, but don't give up on throughput, cost and consistency. The differentiator is that this comes with strict serializability guarantees, so this is not an eventually consistent system (https://jepsen.io/consistency/models/strong-serializable).
No matter how sophisticated the caching is, if you want to retain strict serializability, writes must be confirmed by s3 and reads must validate in s3 before returning, which puts a lower bound on latency.
I focused a lot on throughput, which is the one we can really optimize.
There is also SlateDB, another work in progress take on this. HN link: https://news.ycombinator.com/item?id=41714858
Pretty cool and could be useful for stuff that isnt updated so frequently like a CMS.
Pretty cool! Do you have any ideas already about how to make it work with S3, considering it doesn't support If- headers?
S3 recently added basic matching support (https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3..., https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...).
They don't have the full suite of GCS's capabilities (https://cloud.google.com/storage/docs/request-preconditions#...) but it's something.
I think it's now much easier to achieve than a year ago. The critical one is conditional writes on new objects, because otherwise you can't safely create transaction logs in the presence of timeouts. This is not enough though.
My approach on S3 would be to ensure to modify the ETag of an object whenever other transactions looking at it must be blocked. This makes it easier to use conditional reads (https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...) on COPY or GET operations.
For write, I would use PUT on a temporary staging area and then conditional COPY + DELETE afterward. This is certainly slower than GCS, but I think it should work.
Locking without modifying the object is the part that needs some optimization though.
And I see more possibilities now that https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3... is available. It will get easier and easier to build serverless data lakes, streaming, queues.
Not a full solution, but seeing the OP seeks to be a key-value store (versus full RDBMS? despite the comparisons with Spanner and Postgres?), important to weigh how Rockset (also mainly KV store) dealt with S3-backed caching at scale:
Keep in mind Rockset is definitely a bit biased towards vector search use cases.Nice, thanks for the reference!
BTW, the comparison was only to give an idea about isolation levels, it wasn't meant to be a feature-to-feature comparison.
Perhaps I didn't make it prominent enough, but at some point I say that many SQL databases have key-value stores at their core, and implement a SQL layer on top (e.g. https://www.cockroachlabs.com/docs/v22.1/architecture/overvi...).
Basically SQL can be a feature added later to a solid KV store as a base.
so... Delta Lake?
Congrats on reinventing the data lake? This is actually how most of the newer generations of "cloud native" databases work, where they separate compute and storage. The key is that they have a more sophisticated caching layer so that the latency cost of a query can be amortized across requests.
It's my understanding that the newer generation of data lakes still make use of a tiny, strongly consistent metadata database to keep track of what is where. This is orders of magnitudes smaller than what you'd have by putting everything in the same database, but it's still there. This is also the case in newer data streaming platforms (e.g. https://www.warpstream.com/blog/kafka-is-dead-long-live-kafk...).
I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.
> I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.
Take a look at Delta Lake
https://notes.eatonphil.com/2024-09-29-build-a-serverless-ac...
Love your article by the way. Not an expert but off the top of my head:
https://docs.datomic.com/operation/architecture.html
(However they cheat with dynamo lol)
There's also some listed here
https://davidgomes.com/separation-of-storage-and-compute-and...
OK, thanks for the reference. Yeah, so indeed separating storage and compute is nothing new. Definitely not claiming I invented that :)
And as you mention, Datomic uses DynamoDB as well (so, not a pure s3 solution). What I'm proposing is to only use object storage for everything, pay the price in latency, but don't give up on throughput, cost and consistency. The differentiator is that this comes with strict serializability guarantees, so this is not an eventually consistent system (https://jepsen.io/consistency/models/strong-serializable).
No matter how sophisticated the caching is, if you want to retain strict serializability, writes must be confirmed by s3 and reads must validate in s3 before returning, which puts a lower bound on latency.
I focused a lot on throughput, which is the one we can really optimize.
Hopefully that's clear from the blog, though.
Have you seen https://news.ycombinator.com/item?id=42174204
I just saw it! I asked a question (https://news.ycombinator.com/item?id=42180611) and it seems that durability and consistency are implemented at the caching layer.
Basically an in-memory database which uses S3 as cold storage. Definitely an interesting approach, but no transactions AFAICT.