Distributed Erlang

(vereis.com)

211 points | by todsacerdoti 8 months ago ago

64 comments

toast0 8 months ago

> This can lead to scalability issues in large clusters, as the number of connections that each node needs to maintain grows quadratically with the number of nodes in the cluster.

No, the total number of dist connections grows quadratically with the number of nodes, but the number of dist connections each node makes grows linearally.

> Not only that, in order to keep the cluster connected, each node periodically sends heartbeat messages to every other node in the cluster.

IIRC, heat beats are once every 30 seconds by default.

> This can lead to a lot of network traffic in large clusters, which can put a strain on the network.

Lets say I'm right about 30 seconds between heart beats, and you've got 1000 nodes. Every 30 seconds each node sends out 999 heartbeats (which almost certainly fit in a single tcp packet each, maybe less if they're piggybacking on real data exchanges). That's 999,000 packets every second, or 33k pps across your whole cluster. For reference, GigE line rate with full 1500 mtu packets is 80k pps. If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.

> Historically, a "large" cluster in Erlang was considered to be around 50-100 nodes. This may have changed in recent years, but it's still something to be aware of when designing distributed Erlang systems.

I don't have recent numbers, but Rick Reed's presentation at Erlang Factory in 2014 shows a dist cluster with 400 nodes. I'm pretty sure I saw 1000+ node clusters too. I left WhatsApp in 2019, and any public presentations from WA are less about raw scale, because it's passe.

Really, 1000 dist connections is nothing when you're managing 500k client connections. Dist connections weren't even a big deal when we went to smaller nodes in FB.

It's good to have a solid backend network, and to try to bias towards fewer larger nodes, rather than more smaller nodes. If you want to play with large scale dist, so you spin up 1000 low cpu, low memory VMs, you might have some trouble. It makes sense to start with small nodes and whatever number makes you comfortable for availability, and then when you run into limits, reach for bigger nodes until you get to the point where adding nodes is more cost effective: WA ran dual xeon 2690 servers before the move to FB infra; facebook had better economics with smaller single Xeon D nodes; I dunno what makes sense today, maybe a single socket Epyc?

[-]

sausagefeet 8 months ago

> That's 999,000 packets every second, or 33k pps across your whole cluster. For reference, GigE line rate with full 1500 mtu packets is 80k pps. If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.

Using up almost half of your pps every 30 seconds for cluster maintenance certainly seems like it's more than "not a big deal", no?

[-]

toast0 8 months ago

(Oops, I meant to say 999k packets every 30 seconds. Thanks everyone for running with the pps number)

If your switching fabric can only deal with 1Gbps, yes, you've used it halfway up with heartbeats. But if your network is 1x 48 port 1G switch and 44x 24 port 1G switches, you won't bottleneck on heartbeats, because that spine switch should be able to simultaneously send and receive at line rate on all ports whicj is plenty of bandwidth. You might well bottleneck on other transmissions, but the nice thing about dist heartbeats is on a connection, each node is sending heartbeats on a timer and will close the connection if it doesn't see a heartbeat in some timeframe; it's a requirement for progress, it's not a requirement for a timely response, so you can end up with epic round trip times for net_adm:ping ... I've seen on the order of an hour once over a long distance dist connection with an unexpected bandwidth constraint.

It would probably be a lot more comfortable if your spine switch was 10g and your node switches had a 10g uplink, and you may want to consider LACP and double up all the connections. You might also want to consider other topologies, but this is just an illustration.

davisp 8 months ago

> If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.

I think you’re missing the fact that the heart beats will be combined with existing packets. Hence the quoted bit. If you’ve got 1000 nodes, they should be doing something with that network such that an extra 50 bytes (or so) every 30s would not be an issue.

[-]

gpderetta 8 months ago

They would be combined if each node was sending messages each second to every other node. Is that realistic?

[-]

davisp 7 months ago

That’ll just depend on whatever code was deployed to the cluster. For the clusters I used to operate, the answer would be absolutely all nodes talk to all nodes all the time.

I personally never operated anything above roughly 250 nodes, but that limit was mostly due to following the OP’s advice about paying attention to the configuration of each node in the cluster. In my case, fewer nodes with fancier and larger raid arrays ended up being a better scaling strategy.

desdenova 8 months ago

If you're at the point where you decided the 1000 nodes are required, you should probably have already considered the sensible alternatives first, and concluded this was somehow better.

The network saturation is just a necessary cost of running such a massive cluster.

I really have no idea what kind of system would require 1000 nodes, that couldn't be replaced by 100, 10x larger, nodes instead. And at that point, you should probably be thinking of ways to scale the network itself as well.

[-]

sausagefeet 8 months ago

I don't really understand what this comment is responding to. The comment I responded to hand waved away consuming almost 50% of your pps on heartbeats every 30 seconds as "no big deal".

[-]

ricketycricket 8 months ago

> The comment I responded to hand waved away consuming almost 50% of your pps on heartbeats every 30 seconds as "no big deal".

> The network saturation is just a necessary cost of running such a massive cluster.

I think this actually answers it perfectly.

1. If you are running 1K distributed nodes, you have to understand that means you have some overhead for running such a large cluster. No one is hand waving this away, it's just being acknowledged that this level of complexity has a cost.

2. If heartbeats are almost 50% of your pps, you are trying to use 1Gbe to run a 1K-node cluster. No one would do this in production and no one is claiming you should.

3. If your system can tolerate it, change the heartbeat interval to whatever you want.

4. Don't use distributed Erlang if you don't have to. Erlang/Elixir/Gleam work perfectly fine for non-distributed workloads as do most languages that can't distribute in the first place. But if you do need a distributed system, you are unlikely to find a better way to do it than the BEAM.

Basically, it seems you are taking issue with something that 1) is that way because that's how things work, and 2) is not how anyone would actually use it.

[-]

gatnoodle 7 months ago

Also, 5. If you plan on deploying such massive clusters, hire a good network engineer. (:

toast0 8 months ago

> I really have no idea what kind of system would require 1000 nodes, that couldn't be replaced by 100, 10x larger, nodes instead.

Nodes only get so big. Way back when, a quad xeon 4650 v2 was definitely not 2x the throughput of a dual xeon 2690 v2; so you end up with two dual socket systems instead of one quad socket. A quad socket server often costs significantly more than two dual socket servers, and is likely to take longer between order and delivery. There's usually a point where you can still scale up, but scaling out is a better use of resources.

throwawaymaths 8 months ago

IIRC a preso by them, WA runs an eye popping number of nodes in distribution and they are indeed fully connected, and it's fine (though they do use a modified BEAM)

[-]

anonymousDan 7 months ago

Interesting, do you have a link by any chance?

[-]

throwawaymaths 7 months ago

I remember seeing this when it hit the internet:

https://www.youtube.com/watch?v=A5bLRH-PoMY

40,000 erlang nodes in a cluster

lots of specifics about what tweaks they use. Rewatching, it seems like you don't have to really use the modified BEAM except in a few small soft-code (replace OTP/stdlib functionality) where they provide enough information that you could probably write it yourself if you had to, and a lot of the optimizations have been upstreamed into core BEAM -- WA is a pretty good citizen of the ecosystem.

toast0 7 months ago

From 2014 there's a video link and slides here: https://www.erlang-factory.com/sfbay2014/rick-reed FYI: I've seen comments from after I left that wandist isn't used anymore; I think a lot of what we gained with that was working around issues in pg2 that stem from global locks not scaling... but the new pg doesn't need global locks at all. There was also some things in wandist to work around difficulties communicating between SoftLayer nodes and Facebook nodes, but that was a transitory need. See the 2024 presentation, 40k nodes!

Fairly similar, but smaller numbers in 2012 http://www.erlang-factory.com/conference/SFBay2012/speakers/...

The 2013 presentation is focused on MMS which I don't remember if it was as impressive: http://www.erlang-factory.com/conference/SFBay2013/speakers/... (note that server side transcoding is from before end to end encryption)

I don't think there were similar presentations on Erlang in the large at WhatsApp after that. Big changes between 2014 and 2019 (when I left) were

a) chat servers started doing a lot more, and clients per server went down on the big SoftLayer boxes

b) hosting moved from SoftLayer to Facebook and much smaller nodes --- also chat servers at SoftLayer were individually addressable, using (augmented) round robin DNS to select, at Facebook the chat servers did not have public addresses, instead everything comes in through load balancers

c) MMS was pretty much offloaded into a Facebook storage service (c++); not because the Erlang wasn't sufficient, but because MMS was loosely coupled with the rest of the service, Facebook had a nice enough storage service, a lot of storage, an awful lot of bandwidth, and it wasn't a lot of work for that team to also support WhatsApp's needs; also our Erlang MMS (and the PHP version before it) was built around storing files on specific, addressable nodes, but nodes at Facebook are much more ephemeral and not easy to directly address by clients

d) some amount of data storage moved off of mnesia into other Facebook data storage technology; again, not because mnesia wasn't sufficient, but more ephemeral nodes makes it cumbersome (addressable) and the available hardware nodes at FB didn't really match --- there's a very firm bias at FB towards using standard node sizes and the available standard nodes were like a web machine with not much ram or a big database machine with more ram and tons of fast storage; WA mnesia wants lots of ram but doesn't need a lot of storage (all data is in ram, and dumped + logged to disk) so there was a mismatch there --- things that stayed in mnesia needed much larger clusters to manage data size

Presentations became less common because of more layers to get approval, and also because it's less fun to share how we built something on top of proprietary layers that others don't really have access to. Anybody could have gotten dual 2690 servers at SoftLayer and run a nice Erlang cluster. Only a few people could run an even bigger chat cluster in a Facebook like hosting environment.

immibis 8 months ago

That's for 1000 nodes all sharing a single 1Gbps connection and sending 1500-byte heartbeats for some reason. Make them 150 bytes (more realistic) and now it's almost half the throughput of the 100Mbps router that for some reason you are sending every single packet through, for 1000 nodes.

[-]

gatnoodle 8 months ago

I see, it makes more sense now.

anoother 8 months ago

On a gigabit connection. It's hard to imagine clusters running on anything below 10Gb as an absolute minimum.

sgarland 8 months ago

If you’re running 1Gbe in prod at any kind of scale, something has gone horribly awry.

crabbone 8 months ago

I doubt these are even sent if there's actual traffic going through the network. I mean, why do you need to ping nodes, if they are already talking to you?

fwip 8 months ago

Wouldn't you have a lot more than one gigabit connection? I'm struggling to imagine the 1000-node cluster layout that sends all intra-cluster traffic through a single cable.

worthless-trash 8 months ago

It is possible to have a heartbeat network a data network if so required.

8 months ago

[deleted]

vereis 8 months ago

ahoy!! thanks for spotting the issue I wrote the post in a stream of consciousness after a long day! I'll make that edit and call it out!

the statement about what historically constitutes a large erlang cluster was an anecdote told to by Francesco Cesarini during lunch a few years ago — I'm not actually sure of the time frame (or my memory!)

likewise I'll update the post to reflect that! thanks ( ◜ ‿ ◝ ) ♡

znpy 7 months ago

> WA ran dual xeon 2690 servers before the move to FB infra; facebook had better economics with smaller single Xeon D nodes; I dunno what makes sense today, maybe a single socket Epyc?

that cpu is 8c/16t [1] ... having two of them in a system would mean 16c/32t.

AMD recently launched a 192c/384t monster cpu [2], and having two of them would mean having 384c/768t.

So most likely a 100 nodes-clusters today, powered by those amd cpus would give you the same core count of a 1200-node cluster at the time.

references:

[1]: https://www.intel.com/content/www/us/en/products/sku/64596/i...

[2]: https://www.youtube.com/watch?v=S-NbCPEgP1A

8 months ago

[deleted]

pstoll 7 months ago

Ah the "erlang = distributed" thing.

The primitives of sending messages across a "cluster" are built in the language, yep. And lightweight processes with a per-process garbage collector is magic on minimizing global gc pauses.

But all the hard work to make a distributed system work are not "batteries included" for Erlang. Back pressure on messages, etc. you end up havig to build it yourself.

We hit a limit at around 200 physical servers in a cluster back in 2015. Maybe it's gotten better since then. shrug.

As the author calls out, it was built for the 80s, when a million dollar switch shouldn't fail. It isn't built with https://en.wikipedia.org/wiki/STONITH in mind i.e. any node is trash, assume it will go away and maybe new ones will come back.

Rock on, Erlang dude, rock on.

qart 8 months ago

What's with the font? It seems to be flickering. Especially the headings. Is there a way to turn it off?

[-]

dan353hehe 8 months ago

I think it’s meant to replicate a CRT display? I has a real hard time with those headings buzzing slightly.

[-]

rbanffy 7 months ago

Speaking as someone who spent a lot of time in front of CRTs, this is NOT what an average CRT looks like. This would be reason to send it to maintenance back then - it looks like breaking contacts or problems with the analog board.

A good CRT would show lines, but they'd be steady and not flicker (unless you self-inflict an interlaced display on a short phosphor screen). It might also show some color convergence issues, but that, again, is adjustable (or you'd send it to maintenance).

This looks like the kind of TV you wouldn't plug a VIC-20 to.

emmanueloga_ 8 months ago

Uses @keyframes and text-shadow to try and mimic a CRT effect but makes the text unreadable (for me at least). The browser readability mode does work on the page though.

[-]

dkersten 8 months ago

Makes it unreadable for me too.

johnisgood 8 months ago

I think so too, and works on my old LCD monitor. If I focus on it, I can see the subtle changes, but other than that, it does not make it any less readable for me.

amw 7 months ago

OP, if you're reading this, the animation pegs my CPU. Relying on your readers to engage reader mode is going to turn away a lot of people who might otherwise enjoy your content.

[-]

SrslyJosh 7 months ago

Unfortunately, scrolling is choppy for me even with reader mode turned on. Great article, though.

dtquad 8 months ago

I think it looks cool. Of course there should be a reader/accessibility mode but otherwise we need more creativity on the web.

[-]

MisterTea 7 months ago

Maybe you need creativity but I just need information and presentation isn't nearly as important as legibility.

leoff 8 months ago

type this in your console `document.body.style.cssText = 'animation: none !important'`

funkydata 8 months ago

If one of your browsers has this feature, just toggle reader view.

penguin_booze 7 months ago

On Firefox, Ctrl+Alt+R - also known as Reader View.

8 months ago

[deleted]

desdenova 8 months ago

If a person considers putting this type of effect on text reasonable, I really don't think they have anything of value to say in the text content anyways.

behnamoh 8 months ago

fn/n where n is parity of fn was confusing for me at first, but then I started to like the idea, esp. in dynamic languages it's helpful to have at least some sense of how things work by just looking at them.

this made me think about a universal function definition syntax that captures everything except implementation details of the function. something like:

    fn (a: int, b: float, c: T) {foo: callable, bar: callable, d: int} -> maybe(bool | [str] | [T])

which shows that the function fn receives a (integer), b (float), c (generic type T) and maybe returns something that's either a boolean or a list of strings or a list of T, or it throws and doesn't return anything. the function fn also makes use of foo and bar (other functions) and variable d which are not given to it as arguments but they exist in the "context" of fn.

[-]

wbadart 8 months ago

As I understand it, Erlang inherited its arity notation from Prolog (which early versions of Erlang were implemented in).

https://en.m.wikipedia.org/wiki/Erlang_(programming_language...

elcritch 8 months ago

Well Erlang and Elixir both have a type checker, dialyzer, which has a type specification not to far from what you proposed. Well excluding the variable or function captures.

Elixir’s syntax for declaring type-specs can be found at https://hexdocs.pm/elixir/1.12/typespecs.html

[-]

the_duke 8 months ago

Elixir is also working on a proper type system with compile time + runtime type checks.

arlort 8 months ago

minor nitpick, I think you meant arity. If it's a typo sorry for the nitpick if you had only ever heard it spoken out loud before: arity is the number of parameters (in the context of functions), parity is whether a number is even or odd

[-]

johnisgood 8 months ago

The distinctions are important, would not consider it a minor nitpick. :P

itishappy 8 months ago

Looks similar to how languages with effect systems work. (Or appear to work, I haven't actually sat down with one yet.) They don't capture the exact functions, rather captuing the capabilities of said functions, which are the important bit anyway. The distinction between `print` vs `printline` doesn't matter too much, the fact you're doing `IO` does. It seems to be largely new languages trying this, but I believe there's a few addons for languages such as Haskell as well.

https://koka-lang.github.io/koka/doc/index.html

https://www.unison-lang.org/docs/fundamentals/abilities/

https://dev.epicgames.com/documentation/en-us/uefn/verse-lan...

derefr 8 months ago

> fn/n where n is parity of fn was confusing for me at first, but then I started to like the idea, esp. in dynamic languages it's helpful to have at least some sense of how things work by just looking at them.

To be clear, this syntax is not just "helpful", it's semantically necessary given Erlang's compilation model.

In Erlang, foo/1 and foo/2 are separate functions; foo/2 could be just as well named bar/1 and it wouldn't make a difference to the compiler.

But on the other hand, `foo(X) when is_integer(X)` and `foo(X) when is_binary(X)` aren't separate functions — they're two clause-heads of the same function foo/1, and they get unified into a single function body during compilation.

So there's no way in Erlang for a (runtime) variable [i.e. a BEAM VM register] to hold a handle/reference to "the `foo(X) when is_integer(X)` part of foo/1" — as that isn't a thing that exists any more at runtime.

When you interrogate an Erlang module at runtime in the erl REPL for the functions it exports (`Module:module_info(exports)`), it gives you a list of {FunctionName, Arity} tuples — because those are the real "names" of the functions in the module; you need both the FunctionName and the Arity to uniquely reference/call the function. But you don't need any kind of type information; all the type information is lost from the "structure" of the function, becoming just validation logic inside the function body.

---

In theory, you could have compile-time type safety for function handles, with type erasure at runtime ala Java's runtime erasure of compile-time type parameters. I feel like Erlang is not the type of language to ever bother to add support for this — as it doesn't even accept or return function handles from most system functions, instead accepting or returning {FunctionName, Arity} tuples — but in theory it could.

[-]

toast0 7 months ago

> But Erlang itself, as a language, doesn't even have function-reference literals; you usually just pass {FunctionName, Arity} tuples around.

I don't really understand what a literal is, but isn't fun erlang:node/0 a literal? It operates the same way as 1, which I'm quite sure is a numeric literal:

    Eshell V12.3.2.17  (abort with ^G)
    1> F = fun erlang:node/0.
    fun erlang:node/0
    2> X = 1.
    1
    3> F.
    fun erlang:node/0
    4> X.
    1

I set the variable to the thing, and then when I output the variable, I see the same value I set. AFAIK, both 1 and fun erlang:node/0 must both be literals, as they behave the same; but I'm happy to learn otherwise.

[-]

derefr 7 months ago

You're right! Edited. (I think I never wrote Erlang for long enough before switching to Elixir to find this syntax. Also, I don't think it's ever come up in any Erlang code I've read. It's kind of obscure!)

jlarocco 8 months ago

Meh, if I need that much detail I'll just read the function.

And a function returning one of three types sounds like a TERRIBLE idea.

[-]

c0balt 8 months ago

> And a function returning one of three types sounds like a TERRIBLE idea.

Should have been an enum :)

openrisk 8 months ago

Quite readable even for those not familiar with Erlang.

Is there a list of projects that shows distributed erlang in action (taking advantage of its strengths and avoiding the pitfals)?

[-]

vereis 8 months ago

I recommend taking a look at the various open source Riak applications too! Might not be updated to any sort of recent versions of erlang but was a great resource to me early on.

zaik 8 months ago

ejabberd and RabbitMQ are written in Erlang

[-]

brabel 8 months ago

And CouchDB

anacrolix 8 months ago

awful font.

distributed Erlang is like saying ATM machine

[-]

vereis 8 months ago

honest q: how do you distinguish between single node erlang applications vs clustered erlang applications?

[-]

toast0 8 months ago

This returns true if you're on a single node erlang application :P

    erlang:node() == nonode@nohost andalso erlang:nodes(known) == [nonode@nohost]

A single node erlang application would be one that doesn't use dist at all. Although, if it includes anything gen_event or similar, and it happens to be dist connected, unless it specifically checks, it will happily reply to remote Erlang processes.

hsavit1 8 months ago

yeah, isolated + distributed processes is THE THING about Erlang, OTP

[-]

vereis 8 months ago

my personal reality is that the majority of projects I've consulted on have seldom actually leveraged distributed erlang for anything. the concurrency part yes, clustering for the sake of availability or spreading load yes, but actually doing anything more complex than that has been the exception! ymmv tho!

throwawaymaths 8 months ago

I mean you can certainly run one off scripts in elixir and Erlang.