Garage – An S3 object store so reliable you can run it outside datacenters

(garagehq.deuxfleurs.fr)

495 points | by ibobev 13 hours ago ago

97 comments

Copy/paste from a previous thread [0]:

We’ve done some fairly extensive testing internally recently and found that Garage is somewhat easier to deploy in comparison to our existing use of MinIO, but is not as performant at high speeds. IIRC we could push about 5 gigabits of (not small) GET requests out of it, but something blocked it from reaching the 20-25 gigabits (on a 25g NIC) that MinIO could reach (also 50k STAT requests/s, over 10 nodes)

I don’t begrudge it that. I get the impression that Garage isn’t necessarily focussed on this kind of use case.

---

In addition:

Next time we come to this we are going to look at RustFS [1], as well as Ceph/Rook [2].

We can see we're going to have to move away from MinIO in the foreseeable future. My hope is that the alternatives get a boost of interest given the direction MinIO is now taking.

[0]: https://news.ycombinator.com/item?id=46140342

[1]: https://rustfs.com/

[2]: https://rook.io/

[-]

nine_k 6 hours ago

They explicitly say that top performance is not a goal: «high performances constrain a lot the design and the infrastructure; we seek performances through minimalism only» (https://garagehq.deuxfleurs.fr/documentation/design/goals/)

But it might be interesting to see where the time is spent. I suspect they may be doing fewer things in parallel than MinIO, but maybe it's something entirely different.

__turbobrew__ 7 hours ago

I wouldn’t use rook if you solely want S3. It is a massively complex system which you really need to invest in understanding or else your cluster will croak at some point and you will have no idea on how to fix it.

[-]

breakingcups 7 hours ago

IS there a better solution for self-healing S3 storage that you could recommend? I'm also curious what will make a rook cluster croak after some time and what kind of maintenance is required in your experience.

[-]

adamcharnock 6 hours ago

Not used it yet, but RustFS sounds like it has self healing

https://docs.rustfs.com/troubleshooting/healing.html

adastra22 6 hours ago

ceph?

[-]

yupyupyups an hour ago

Rook is ceph.

hardwaresofton 7 hours ago

Please also consider including SeaweedFS in the testing.

NL807 2 hours ago

>I get the impression that Garage isn’t necessarily focussed on this kind of use case.

I wouldn't be surprised if this will be fixed sometime in the future.

fabian2k 12 hours ago

Looks interesting for something like local development. I don't intend to run production object storage myself, but some of the stuff in the guide to the production setup (https://garagehq.deuxfleurs.fr/documentation/cookbook/real-w...) would scare me a bit:

> For the metadata storage, Garage does not do checksumming and integrity verification on its own, so it is better to use a robust filesystem such as BTRFS or ZFS. Users have reported that when using the LMDB database engine (the default), database files have a tendency of becoming corrupted after an unclean shutdown (e.g. a power outage), so you should take regular snapshots to be able to recover from such a situation.

It seems like you can also use SQLite, but a default database that isn't robust against power failure or crashes seems suprising to me.

[-]

lxpz 9 hours ago

If you know of an embedded key-value store that supports transactions, is fast, has good Rust bindings, and does checksumming/integrity verification by default such that it almost never corrupts upon power loss (or at least, is always able to recover to a valid state), please tell me, and we will integrate it into Garage immediately.

[-]

agavra 8 hours ago

Sounds like a perfect fit for https://slatedb.io/ -- it's just that (an embedded, rust, KV store that supports transactions).

It's built specifically to run on object storage, currently relies on the `object_store` crate but we're consdering OpenDAL instead so if Garage works with those crates (I assume it does if its S3 compatible) it should just work OOTB.

fabian2k 8 hours ago

I don't really know enough about the specifics here. But my main points isn't about checksums, but more something like WAL in Postgres. For an embedded KV store this is probably not the solution, but my understanding is that there are data structures like LSM that would result in similar robustness. But I don't actually understand this topic well enough.

Checksumming detects corruption after it happened. A database like Postgres will simply notice it was not cleanly shut down and put the DB into a consistent state by replaying the write ahead log on startup. So that is kind of my default expectation for any DB that handles data that isn't ephemeral or easily regenerated.

But I also likely have the wrong mental model of what Garage does with the metadata, as I wouldn't have expected that to be ever limited by Sqlite.

[-]

lxpz 7 hours ago

So the thing is, different KV stores have different trade-offs, and for now we haven't yet found one that has the best of all worlds.

We do recommend SQLite in our quick-start guide to setup a single-node deployment for small/moderate workloads, and it works fine. The "real world deployment" guide recommends LMDB because it gives much better performance (with the current status of Garage, not to say that this couldn't be improved), and the risk of critical data loss is mitigated by the fact that such a deployment would use multi-node replication, meaning that the data can always be recovered from another replica if one node is corrupted and no snapshot is available. Maybe this should be worded better, I can see that the alarmist wording of the deployment guide is creating quite a debate so we probably need to make these facts clearer.

We are also experimenting Fjall as an alternate KV engine based on LSM, as it theoretically has good speed and crash resilience, which would make it the best option. We are just not recommending it by default yet, as we don't have much data to confirm that it works up to these expectations.

BeefySwain 9 hours ago

(genuinely asking) why not SQLite by default?

[-]

lxpz 8 hours ago

We were not able to get good enough performance compared to LMDB. We will work on this more though, there are probably many ways performance can be increased by reducing load on the KV store.

[-]

srcreigh 7 hours ago

Did you try WITHOUT ROWID? Your sqlite implementation[1] uses a BLOB primary key. In SQLite, this means each operation requires 2 b-tree traversals: The BLOB->rowid tree and the rowid->data tree.

If you use WITHOUT ROWID, you traverse only the BLOB->data tree.

Looking up lexicographically similar keys gets a huge performance boost since sqlite can scan a B-Tree node and the data is contiguous. Your current implementation is chasing pointers to random locations in a different b-tree.

I'm not sure exactly whether on disk size would get smaller or larger. It probably depends on the key size and value size compared to the 64 bit rowids. This is probably a well studied question you could find the answer to.

[1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4efc8...

rapnie 4 hours ago

I learned that Turso apparently have plans for a rewrite of libsql [0] in Rust, and create a more 'hackable' SQLite alternative altogether. It was apparently discussed in this Developer Voices [1] video, which I haven't yet watched.

[0] https://github.com/tursodatabase/libsql

[1] https://www.youtube.com/watch?v=1JHOY0zqNBY

tensor 8 hours ago

Keep in mind that write safety comes with performance penalties. You can turn off write protections and many databases will be super fast, but easily corrupt.

skrtskrt 8 hours ago

Could you use something like Fly's Corrosion to shard and distribute the SQLite data? It uses a CRDT reconciliation, which is familiar for Garage.

__turbobrew__ 7 hours ago

RocksDB possibly. Used in high throughput systems like Ceph OSDs.

patmorgan23 7 hours ago

Valkey?

yupyupyups an hour ago

Depending on the underlying storage being reliable is far from unique to garage. This is what most other services do too, unless we're talking about something like Ceph which manages the physical storage itself.

Standard filesystems such as ext4 and xfs don't have data checksumming, so you'll have to rely on another layer to provide integrity. Regardless, that's not garage's job imo. It's good that they're keeping their design simple and focus their resources on implementing the S3 spec.

nijave 41 minutes ago

The assumption is nodes are in different fault domains so it'd be highly unlikely to ruin the whole cluster.

LMDB mode also runs with flush/syncing disabled

igor47 11 hours ago

I've been using minio for local dev but that version is unmaintained now. However, I was put off by the minimum requirements for garage listed on the page -- does it really need a gig of RAM?

[-]

dsvf 8 hours ago

I always understood this requirement as "garage will run fine on hardware with 1GB RAM total" - meaning the 1GB includes the RAM used by the OS and other processes. I think that most current consumer hardware that is a, potential garage host, even on the low end, has at least 1GB total RAM.

archon810 11 hours ago

The current latest Minio release that is working for us for local development is now almost a year old and soon enough we will have to upgrade. Curious what others have replaced it with that is as easy to set up and has a management UI.

[-]

mbreese 8 hours ago

I think that's part of the pitch here... swapping out Minio for Garage. Both scale a lot more than for just local development, but local dev certainly seems like a good use-case here.

lxpz 9 hours ago

It does not, at least not for a small local dev server. I believe RAM usage should be around 50-100MB, increasing if you have many requests with large objects.

moffkalast 11 hours ago

That's not something you can do reliably in software, datacenter grade NVMe drives come with power loss protection and additional capacitors to handle that gracefully. If power is cut at the wrong moment the partition may not be mountable afterwards otherwise.

If you really live somewhere with frequent outages, buy an industrial drive that has a PLP rating. Or get a UPS, they tend to be cheaper.

[-]

crote 10 hours ago

Isn't that the entire point of write-ahead logs, journaling file systems, and fsync in general? A roll-back or roll-forward due to a power loss causing a partial write is completely expected, but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted?

As I understood it, the capacitors on datacenter-grade drives are to give it more flexibility, as it allows the drive to issue a successful write response for cached data: the capacitor guarantees that even with a power loss the write will still finish, so for all intents and purposes it has been persisted, so an fsync can return without having to wait on the actual flash itself, which greatly increases performance. Have I just completely misunderstood this?

[-]

unsnap_biceps 10 hours ago

you actually don't need capacitors for rotating media, Western Digital has a feature called "ArmorCache" that uses the rotational energy in the platters to power the drive long enough to sync the volatile cache to a non volatile storage.

https://documents.westerndigital.com/content/dam/doc-library...

[-]

toomuchtodo 10 hours ago

Very cool, like the ram air turbine that deploys on aircraft in the event of a power loss.

patmorgan23 7 hours ago

Good I love engineers

Aerolfos 5 hours ago

> but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted?

That doesn't even help if fsync() doesn't do what developers expect: https://danluu.com/fsyncgate/

I think this was the blog post that had a bunch more stuff that can go wrong too: https://danluu.com/deconstruct-files/

But basically fsync itself (sometimes) has dubious behaviour, then OS on top of kernel handles it dubiously, and then even on top of that most databases can ignore fsync erroring (and lie that the data was written properly)

So... yes.

Nextgrid 10 hours ago

> ignore fsync and blatantly lie that the data has been persisted

Unfortunately they do: https://news.ycombinator.com/item?id=38371307

[-]

btown 10 hours ago

If the drives continue to have power, but the OS has crashed, will the drives persist the data once a certain amount of time has passed? Are datacenters set up to take advantage of this?

SomaticPirate 12 hours ago

Seeing a ton of adoption of this after the Minio debacle

https://www.repoflow.io/blog/benchmarking-self-hosted-s3-com... was useful.

RustFS also looks interesting but for entirely non-technical reasons we had to exclude it.

Anyone have any advice for swapping this in for Minio?

[-]

chrislusf an hour ago

Disclaim: I work on SeaweedFS.

Why skipping SeaweedFS? It rank #1 on all benchmarks, and has a lot of features.

[-]

meotimdihia an hour ago

I confirm this, I used SeaweedFS to serve 1M users daily with 56 million images / ~100TB with 2 servers + HDD only, while Minio can't do this. Seaweedfs performance is much better than Minio's. The only problem is that SeaweedFS documentation is hard to understand.

dionian an hour ago

can you link benchmarks

dpedu 12 hours ago

I have not tried either myself, but I wanted to mention that Versity S3 Gateway looks good too.

https://github.com/versity/versitygw

I am also curious how Ceph S3 gateway compares to all of these.

[-]

skrtskrt 8 hours ago

When I was there, DigitalOcean was writing a complete replacement for the Ceph S3 gateway because its performance under high concurrency was awful.

They just completely swapped out the whole service from the stack and wrote one in Go because of how much better the concurrency management was, and Ceph's team and codebase C++ was too resistant to change.

[-]

jiqiren 5 hours ago

Unrelated, but one of the more annoying aspects of whatever software they use now is lack of IPv6 for the CDN layer of DigitalOcean Spaces. It means I need to proxy requests myself. :(

zipzad 9 hours ago

I'd be curious to know how versitygw compares to rclone serve S3.

Implicated 12 hours ago

> but for entirely non-technical reasons we had to exclude it

Able/willing to expand on this at all? Just curious.

[-]

NitpickLawyer 11 hours ago

Not the same person you asked, but my guess would be that it is seen as a chinese product.

[-]

lima 11 hours ago

RustFS appears to be very early-stage with no real distributed systems architecture: https://github.com/rustfs/rustfs/pull/884

I'm not sure if it even has any sort of cluster consensus algorithm? I can't imagine it not eating committed writes in a multi-node deployment.

Garage and Ceph (well, radosgw) are the only open source S3-compatible object storage which have undergone serious durability/correctness testing. Anything else will most likely eat your data.

dewey 11 hours ago

What is this based on, honest question as from the landing page I don't get that impression. Are many committers China-based?

[-]

NitpickLawyer 11 hours ago

https://rustfs.com.cn/

> Beijing Address: Area C, North Territory, Zhongguancun Dongsheng Science Park, No. 66 Xixiaokou Road, Haidian District, Beijing

> Beijing ICP Registration No. 2024061305-1

scottydelta 10 hours ago

From what I have seen in the previous discussions here (since and before Minio debacle) and at work, Garage is a solid replacement.

klooney 10 hours ago

Seaweed looks good in those benchmarks, I haven't heard much about it for a while.

thhck 11 hours ago

BTW https://deuxfleurs.fr/ is one of the most beautiful website I have ever seen

[-]

codethief 9 hours ago

It's beautiful from an artistic point of view but also rather hard to read and probably not very accessible (haven't checked it, though, since I'm on my phone).

[-]

isoprophlex 8 hours ago

Works perfectly on an iphone. I can't attest to the accessibility features, but the aesthetic is absolutely wonderful. Something I love, and went for on my own portfolio/company website... this is executed 100x better tho, clearly a labor of love and not 30 minutes of shitting around in vi.

topspin 10 hours ago

No tags on objects.

Garage looks really nice: I've evaluated it with test code and benchmarks and it looks like a winner. Also, very straightforward deployment (self contained executable) and good docs.

But no tags on objects is a pretty big gap, and I had to shelve it. If Garage folk see this: please think on this. You obviously have the talent to make a killer application, but tags are table stakes in the "cloud" API world.

[-]

lxpz 9 hours ago

Thank you for your feedback, we will take it into account.

[-]

topspin 7 hours ago

Great, and thank you.

I really, really appreciate that Garage accommodates running as a single node without work-arounds and special configuration to yield some kind of degraded state. Despite the single minded focus on distributed operation you no doubt hear endlessly (as seen among some comments here,) there are, in fact, traditional use cases where someone will be attracted to Garage only for the API compatibility, and where they will achieve availability in production sufficient to their needs by means other than clustering.

ai-christianson 12 hours ago

I love garage. I think it has applications beyond the standard self host s3 alternative.

It's a really cool system for hyper converged architecture where storage requests can pull data from the local machine and only hit the network when needed.

[-]

singpolyma3 5 hours ago

I'd love to hear what configuration you are using for this

Powdering7082 12 hours ago

No erasure coding seems like a pretty big loss in terms of how much resources do you need to get good resiliency & efficiency

[-]

munro 10 hours ago

I was looking at using this on an LTO tape library, it seems the only resiliency is through replication, but this was my main concern with this project, what happens with HW goes bad

[-]

lxpz 9 hours ago

If you have replication, you can lose one of the replica, that's the point. This is what Garage was designed for, and it works.

Erasure coding is another debate, for now we have chosen not to implement it, but I would personally be open to have it supported by Garage if someone codes it up.

[-]

hathawsh 7 hours ago

Erasure coding is an interesting topic for me. I've run some calculations on the theoretical longevity of digital storage. If you assume that today's technology is close to what we'll be using for a long time, then cross-device erasure coding wins, statistically. However, if you factor in the current exponential rate of technological development, simply making lots of copies and hoping for price reductions over the next few years turns out to be a winning strategy, as long as you don't have vendor lock-in. In other words, I think you're making great choices.

[-]

Dylan16807 5 hours ago

I question that math. Erasure coding needs less than half as much space as replication, and imposes pretty small costs itself. Maybe we can say the difference is irrelevant if storage prices will drop 4x over the next five years? But looking at pricing trends right now... that's not likely. Hard drives and SSDs are about the same price they were 5 years ago. The 5 years before that SSDs were seeing good advancements, but hard drive prices only advanced 2x.

eduardogarza 4 hours ago

I use this for booting up S3-compatible buckets for local development and testing -- paired up with s5cmd, I can seed 15GB and over 60,000 items (seed/mock data) in < 60s... have a perfect replica of a staging environment with Docker containers (api, db, cache, objects) all up in less than 2mins. Super simple to set up for my case and been working great.

Previously I used LocalStack S3 but ultimately didn't like the lack of persistance thats not available on the OSS verison. MinIO OSS is apparently no longer maintained? Also looked at SeaweedFS and RustFS but from a quick reading into them this once was the easiest to set up.

[-]

chrislusf an hour ago

I work on SeaweedFS. So very biased. :)

Just run "weed sever -s3 -dir=..." to have an object store.

faizshah 11 hours ago

One really useful usecase for Garage for me has been data engineering scripts. I can just use the S3 integration that every tool has to dump to garage and then I can more easily scale up to cloud later.

yupyupyups an hour ago

Garage is amazing! But it would be even more amazing if it had immutable object support. :)

This is used for ransomware resistant backups.

supernes 8 hours ago

I tried it recently. Uploaded around 300 documents (1GB) and then went to delete them. Maybe my client was buggy, because the S3 service inside the container crashed and couldn't recover - I had to restart it. It's a really cool project, but I wouldn't really call it "reliable" from my experience.

JonChesterfield 10 hours ago

Corrupts data on power loss according to their own docs. Like what you get outside of data centers. Not reliable then.

[-]

lxpz 9 hours ago

Losing a node is a regular occurrence, and a scenario for which Garage has been designed.

The assumption Garage makes, which is well-documented, is that of 3 replica nodes, only 1 will be in a crash-like situation at any time. With 1 crashed node, the cluster is still fully functional. With 2 crashed nodes, the cluster is unavailable until at least one additional node is recovered, but no data is lost.

In other words, Garage makes a very precise promise to its users, which is fully respected. Database corruption upon power loss enters in the definition of a "crash state", similarly to a node just being offline due to an internet connection loss. We recommend making metadata snapshots so that recovery of a crashed node is faster and simpler, but it's not required per se: Garage can always start over from an empty database and recover data from the remaining copies in the cluster.

To talk more about concrete scenarios: if you have 3 replicas in 3 different physical locations, the assumption of at-most one crashed node is pretty reasonable, it's quite unlikely that 2 of the 3 locations will be offline at the same time. Concerning data corruption on a power loss, the probability to lose power at 3 distant sites at the exact same time with the same data in the write buffers is extremely low, so I'd say in practice it's not a problem.

Of course, this all implies a Garage cluster running with 3-way replication, which everyone should do.

[-]

JonChesterfield 7 hours ago

That is a much stronger guarantee than your documentation currently claims. One site falling over and being rebuilt without loss is great. One site losing power, corrupting the local state, then propagating that corruption to the rest of the cluster would not be fine. Different behaviours.

[-]

lxpz 7 hours ago

Fair enough, we will work on making the documentation clearer.

jiggawatts 8 hours ago

So if you put a 3-way cluster in the same building and they lose power together, then what? Is your data toast?

[-]

InitialBP 7 hours ago

It sounds like that's a possibility, but why on earth would you take the time to setup a 3 node cluster of object storage for reliability and ignore one of the key tenants of what makes it reliable?

lxpz 8 hours ago

If I make certain assumptions and you respect them, I will give you certain guarantees. If you don't respect them, I won't guarantee anything. I won't guarantee that your data will be toast either.

[-]

Dylan16807 4 hours ago

If you can't guarantee anything for all the nodes losing power at the same time, that's really bad.

If it's just the write buffer at risk, that's fine. But the chance of overlapping power loss across multiple sites isn't low enough to risk all the existing data.

awoimbee 8 hours ago

How is garage for a simple local dev env ? I recently used seaweedfs since they have a super simple minimal setup compared to garage which seemed to require a config file just to get started.

k__ 3 hours ago

Half-OT:

Does anyone know a good open source S3 alternarive that's easily extendable with custom storage backends?

For example, AWS offers IA and Glacier in addition to the defaults.

[-]

onionjake 3 hours ago

Storj supports arbitrary configured backends each with different erasure coding, node placement, etc.

apawloski 10 hours ago

Is it the same consistency model as S3? I couldn't see anything about it in their docs.

[-]

lxpz 9 hours ago

Read-after-write consistency : yes (after PutObject has finished, the object will be immediately visible in all subsequent requests, including GetObject and ListObjects)

Conditionnal writes : no, we can't do it with CRDTs, which are the core of Garage's design.

[-]

skrtskrt 8 hours ago

Does RAMP or CURE offer any possibility of conditional writes with CRDTs? I have had these papers on my list to read for months, specifically wondering if it could be applied to Garage

https://dd.thekkedam.org/assets/documents/publications/Repor... http://www.bailis.org/papers/ramp-sigmod2014.pdf

[-]

lxpz 7 hours ago

I had a very rapid look at these two papers, it looks like none of them allow the implementation of compare-and-swap, which is required for if-match / if-none-match support. They have a weaker definition of a "transaction". Which is to be expected as they only implement causal consistency at best and not consensus, whereas consensus is required for compare-and-swap.

[-]

skrtskrt 4 hours ago

ack - makes sense, thank you for looking!

wyattjoh 11 hours ago

Wasn't expecting to see it hosted on forgejo. Kind of a breath of fresh air to be honest.

agwa 11 hours ago

Does this support conditional PUT (If-Match / If-None-Match)?

[-]

codethief 9 hours ago

https://news.ycombinator.com/item?id=46328218

allanrbo 9 hours ago

I use Syncthing a lot. Is Garage only really useful if you specifically want to expose an S3 drop in compatible API, or does it also provide other benefits over syncthing?

[-]

lxpz 8 hours ago

They are not solving the same problem.

Syncthing will synchronize a full folder between an arbitrary number of machines, but you still have to access this folder one way or another.

Garage provides an HTTP API for your data, and handles internally the placement of this data among a set of possible replica nodes. But the data is not in the form of files on disk like the ones you upload to the API.

Syncthing is good for, e.g., synchronizing your documents or music collection between computers. Garage is good as a storage service for back-ups with e.g. Restic, for media files stored by a web application, for serving personal (static) web sites to the Internet. Of course, you can always run something like Nextcloud in front of Garage and get folder synchronization between computers somewhat like what you would get with Syncthing.

But to answer your question, yes, Garage only provides a S3-compatible API specifically.

sippeangelo 8 hours ago

You use Syncthing for object storage?

Eikon 11 hours ago

Unfortunately, this doesn’t support conditional writes through if-match and if-none-match [0] and thus is not compatible with ZeroFS [1].

[0] https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1052

[1] https://github.com/Barre/ZeroFS

[-]

chrislusf 2 hours ago

I work on SeaweedFS. It has support for these if conditions, and a lot more.

ekjhgkejhgk 9 hours ago

Anybody understand how this compares with Vast?