When etcd crashes, check your disks first

(nubificus.co.uk)

11 points | by _ananos_ 4 hours ago ago

3 comments

kg an hour ago

> etcd is a strongly consistent, distributed key-value store, and that consistency comes at a cost: it is extraordinarily sensitive to I/O latency. etcd uses a write-ahead log and relies on fsync calls completing within tight time windows. When storage is slow, even intermittently, etcd starts missing its internal heartbeat and election deadlines. Leader elections fail. The cluster loses quorum. Pods that depend on the API server start dying.

This seems REALLY bad for reliability? I guess the idea is that it's better to have things not respond to requests than to lose data, but the outcome described in the article is pretty nasty.

It seems like the solution they arrived at was to "fix" this at the filesystem level by making fsync no longer deliver reliability, which seems like a pretty clumsy solution. I'm surprised they didn't find some way to make etcd more tolerant of slow storage. I'd be wary of turning off filesystem level reliability at the risk of later running postgres or something on the same system and experiencing data loss when what I wanted was just for kubernetes or whatever to stop falling over.

[-]

denysvitali an hour ago

Yes, wouldn't their fix likely make etcd not consistent anymore since there's no guarantee that the data was persisted on disk?

[-]

justincormack 37 minutes ago

Yes, they totally missed the point of the fsync...