It's Time to Replace TCP in the Datacenter

(arxiv.org)

96 points | by ilove_banh_mi 9 hours ago ago

58 comments

  • UltraSane 5 hours ago

    I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter. It is a very robust L3 protocol. It was designed to connect block storage devices to servers while making the OS think they are directly connected. OSs do NOT tolerate dropped data when reading and writing to block devices and so Fibre Channel has a extremely robust Token Bucket algorithm. The algo prevents congestion by allowing receivers to control how much data senders can send. I have worked with a lot of VMware clusters that use FC to connect servers to storage arrays and it has ALWAYS worked perfectly.

    • Sebb767 4 hours ago

      > I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter

      But it is often used for block storage in datacenters. Using it for anything else is going to be hard, as it is incompatible with TCP.

      The problem with not using TCP is the same thing HOMA will face - anything already speaks TCP, nearly all potential hires know TCP and most problems you have with TCP have been solved by smart engineers already. Hardware is also easily available. Once you drop all those advantages, either your scale or your gains need to be massive to make that investment worth it, which is why TCP replacements are so rare outside of FAANG.

      • ksec an hour ago

        I wonder if there are any work on making something similar ( conceptually ) to TCP, super / sub set of TCP while offering 50-80% benefits of HOMA.

        I guess I am old. Everytime I see new tech that wants to be hyped, completely throw out everything that is widely supported and working for 80-90% of uses cases, not battle tested and may be conceptually complex I will simply pass.

        • Sebb767 2 minutes ago

          If you have a sufficiently stable network and/or known failure cases, you can already tune TCP quite a bit with nodelay, large congestion windows etc.. There's also QUIC, which basically is a modern implementation of TCP on top of UDP (with some trade-offs chosen with HTTP in mind). Once you stray too far, you'll loose the ability to use off-the-shelve hardware, though, at which point you'll quickly hit the point of diminishing returns - especially when simply upgrading the speed of the network hardware is usually a cheap alternative.

    • slt2021 an hour ago

      my take is that within-datacenter traffic is best served by Ethernet.

      Anything on top of Ethernet, and we no longer know where this host is located (because of software defined networking). Could be next rack server, or could be something in the cloud, could be third party service.

      And that's a feature, not a bug: because everything speaks TCP: we can arbitrarily cut and slice network just by changing packet forwarding rules. We can partition network however we want.

      We could have a single global IP space shared by cloud, dc, campus networks, or could have Christmas Tree of NATs.

      as soon as you introduce something other than TCP to the mix, now you will have gateways: chokepoints where traffic will have to be translated TCP<->Homa and I don't want to be a person troubleshooting a problem at the intersection of TCP and Homa.

      in my opinion, the lowest level Ethernet should try its best to mirror the actual physical signal flow. Anything on top becomes software-network network

    • YZF 3 hours ago

      Are you suggesting some protocol layer of Fibre Channel to be used over IP over Ethernet?

      TCP (in practice) runs on top of (mostly) routed IP networks and network architectures. E.g. a spine/leaf network with BGP. Fibre Channel as I understand it is mostly used in more or less point to point connections? I do see some mention of "Switched Fabric" but is that very common?

      • UltraSane 38 minutes ago

        Fibre Channel is a routed L3 protocol that can support loop-free multi-path typologies.

    • wejick 2 hours ago

      I'm imagining having a shared memory mounted as block storages then do the RPC thru this block. Some synchronization and polling/notifications work will need to be done.

  • pif an hour ago

    > Although Homa is not API-compatible with TCP,

    IPv6 anyone? People must start to understand that "Because this is the way it is" is a valid, actually extremely valid, answer to any question like "Why don't we just switch technology A with technology B?"

    Despite all the shortcomings of the old technology, and the advantages of the new one, inertia _is_ a factor, and you must accept that most users will simply even refuse to acknowledge the problem you want to describe.

    For you your solution to get any traction, it must deliver value right now, in the current ecosystem. Otherwise, it's doomed to fail by being ignored over and over.

  • rwmj 16 minutes ago

    On a related topic, has anyone had luck deploying TCP fastopen in a data center? Did it make any difference?

    In theory for shortlived TCP connections, fastopen ought to be a win. It's very easy to implement in Linux (just a couple of lines of code in each client & server, and a sysctl knob). And the main concern about fastopen is middleboxes, but in a data center you can control what middleboxes are used.

    In practice I found in my testing that it caused strange issues, especially where one side was using older Linux kernels. The issues included not being able to make a TCP connection, and hangs. And when I got it working and benchmarked it, I didn't notice any performance difference at all.

  • parasubvert 3 hours ago

    This has already been done at scale with HTTP/3 (QUIC), it's just not widely distributed beyond the largest sites & most popular web browsers. gRPC for example is still on multiplexed TCP via HTTP/2, which is "good enough" for many.

    Though it doesn't really replace TCP, it's just that the predominant requirements have changed (as Ousterhout points out). Bruce Davie has a series of articles on this: https://systemsapproach.substack.com/p/quic-is-not-a-tcp-rep...

    Also see Ivan Pepelnjak's commentary (he disagrees with Ousterhout): https://blog.ipspace.net/2023/01/data-center-tcp-replacement...

  • akira2501 6 hours ago

    > If Homa becomes widely deployed, I hypothesize that core congestion will cease to exist as a significant networking problem, as long as the core is not systemically overloaded.

    Yep. Sure; but, what happens when it becomes overloaded?

    > Homa manages congestion from the receiver, not the sender. [...] but the remaining scheduled packets may only be sent in response to grants from the receiver

    I hypothesize it will not be a great day when you do become "systemically" overloaded.

    • andrewflnr 5 hours ago

      Will it be a worse day than it would be with TCP? Either way, the only solution is to add more hardware, unless I'm misunderstanding the term "systemically overloaded".

  • wmf 6 hours ago

    Previous discussions:

    Homa, a transport protocol to replace TCP for low-latency RPC in data centers https://news.ycombinator.com/item?id=28204808

    Linux implementation of Homa https://news.ycombinator.com/item?id=28440542

  • Woodi 2 hours ago

    You want to replace TCP becouse it is bad ? Then give better "connected" protocol over raw IP and other raw network topologies. Use it. Done.

    Don't mess with another IP -> UDP -> something

  • slt2021 5 hours ago

    the problem with trying to replace TCP only inside DC, is because TCP will still be used outside DC.

    Networking Engineering is already convoluted and troublesome as it is right now, using only tcp stack.

    When you start using homa inside, but TCP from outside things will break, because a lot of DC requests are created as a response for an inbound request from outside DC (like a client trying to send RPC request).

    I cannot imagine trying to troubleshoot hybrid problems at the intersection of tcp and homa, its gonna be a nightmare.

    Plus I don't understand why create a a new L4 transport protocol for a specific L7 application (RPC)? This seems like a suboptimal choice, because RPC of today could be replaced with something completely different, like RDMA over Ethernet for AI workloads or transfer of large streams like training data/AI model state.

    I think tuning TCP stack in the kernel, adding more configuration knobs for TCP, switching from stream(tcp) to packet (udp) protocols where it is warranted, will give more incremental benefits.

    One major thing author missed is security applications, these are considered table stakes: 1. encryption in transit: handshake/negotiation 2. ability to intercept and do traffic inspection for enterprise security purposes 3. resistance to attacks like flood 4. security of sockets in containerized Linux environment

    • jayd16 4 hours ago

      Are you imagining external TCP traffic will be translated at the load balancer or are you actually worried that requests out of an API Gateway need to be identical to what goes in?

      I could see the former being an issue (if that's even implied by "inside the data center") and I just don't see how it's a problem for the latter.

      • slt2021 an hour ago

        A typical software L7 load balancer (like nginx) will parse entire TCP stream and HTTP header and applies bunch of logic based on URL, and various HTTP headers.

        There is a lot of work going on in the userland, like filling up TCP buffer, parsing HTTP stream, applying bunch of business logic, creating a downstream connection, sending data, getting response, etc.

        This is a lot of work in the userland and because of that a default nginx config is like 1024 concurrent connections per core, so not a lot.

        L4 load balance on the other hand works purely in a packet switching mode or NAT mode. So the work consists in just replacing IP header fields (src.ip, src.port, dst.ip, dst.port, proto), it can use various frameworks like intel vectorized packet processing or Intel dpdk for accelerated packet switching.

        Because of that, L4 load balancer can work perform very very close to the line rate speed, meaning it can load balance connections as fast as packets arrive to the network interface card. Line rate is the theoretical maximum of packet processing.

        In case of stateless L4 load balancing there is no upper bound in number of concurrent sessions to balance, it will almost as fast as core router that feeds the data.

        As you can see L4 is clearly superior in performance, but the reason L4 LB is possible is because it has TCP inbound and TCP outbound, so the only work required is replace IP header and recalculate CRC.

        With Homa, you would need to fully process TCP stream, before you initiate Homa connection, meaning you will waste a lot of RAM on keeping TCP buffers and rebuilding the stream according to the TCP sequence. Homa will lose all its benefits in the load balancing scenario.

        Author pitches only one use case for homa: East-West traffic, but again - these days the software is really agnostic of this East-West direction. What your software thinks is running in the server in next rack, could as well be a server in a different Availability Zone or read replica in different geo region.

        And that's the beauty of modern infra: everything is a software, everything is ephemeral, and we don't really care if we running this in a single DC or multiple DCs.

        Because of that, I think we will still stick to TCP as a proven protocol that will seamlessly interop when crossing different WAN/LAN/VPN networks

        I am not even talking about software defined networks, like SD-WAN where transport&signaling is done by the vendor-specfic underlay network, and overlay network is really just abstraction for users that hides a lot network discovery and network management underneath

    • nicman23 4 hours ago

      only thing homa makes sense is when there is no external tcp to the peers or at least not on the same context ie for roce

      • slt2021 an hour ago

        1. add software defined network, where transport and signaling is done by vendor-specific underlay, possibly across multiple redundant uplinks

        2. term "external" is really vague as modern networks have blended boundaries. Things like availability zone, region make dc-dc connection irrelevant, because at any point of time you will be required to failover to another AZ/DC/region.

        3. when I think of inter-Datacenter, I can only think of Ethernet. That's really it. Even in Ethernet, what you think of a peer and existing in your same subnet, could be a different DC, again due to software-defined network.

  • unsnap_biceps 6 hours ago

    The original paper was discussed previously at https://news.ycombinator.com/item?id=33401480

  • mhandley 37 minutes ago

    It's already happening. For the more demanding workloads such as AI training, RDMA has been the norm for a while, either over Infiniband or Ethernet, with Ethernet gaining ground more recently. RoCE is pretty flawed though for reasons Ousterhout mentions, plus others, so a lot of work has been happening on new protocols to be implemented in hardware in next-gen high performance NICs.

    The Ultra Ethernet Transport specs aren't public yet so I can only quote the public whitepaper [0]:

    "The UEC transport protocol advances beyond the status quo by providing the following:

    ● An open protocol specification designed from the start to run over IP and Ethernet

    ● Multipath, packet-spraying delivery that fully utilizes the AI network without causing congestion or head-of-line blocking, eliminating the need for centralized load-balancing algorithms and route controllers

    ● Incast management mechanisms that control fan-in on the final link to the destination host with minimal drop

    ● Efficient rate control algorithms that allow the transport to quickly ramp to wire-rate while not causing performance loss for competing flows

    ● APIs for out-of-order packet delivery with optional in-order completion of messages, maximizing concurrency in the network and application, and minimizing message latency

    ● Scale for networks of the future, with support for 1,000,000 endpoints

    ● Performance and optimal network utilization without requiring congestion algorithm parameter tuning specific to the network and workloads

    ● Designed to achieve wire-rate performance on commodity hardware at 800G, 1.6T and faster Ethernet networks of the future"

    You can think of it as the love-child of NDP [2] (including support for packet trimming in Ethernet switches [1]) and something similar to Swift [3] (also see [1]).

    I don't know if UET itself will be what wins, but my point is the industry is taking the problems seriously and innovating pretty rapidly right now.

    Disclaimer: in a previous life I was the editor of the UEC Congestion Control spec.

    [0] https://ultraethernet.org/wp-content/uploads/sites/20/2023/1...

    [1] https://ultraethernet.org/ultra-ethernet-specification-updat...

    [2] https://ccronline.sigcomm.org/wp-content/uploads/2019/10/acm...

    [3] https://research.google/pubs/swift-delay-is-simple-and-effec...

  • runlaszlorun 5 hours ago

    For those who might not have noticed, the author is John Ousterhout- best known for TCL/Tk as well as the Raft consensus protocol among others.

    • signa11 4 hours ago

      and more recently (?) the book : “a philosophy of software design”, highly recommended !

  • kmeisthax an hour ago

    Dumb question: why was it decided to only provide an unreliable datagram protocol in standard IP transit?

    • michaelt an hour ago

      Because when you're sending a signal down a wire or through the air, fundamentally the communication medium only provides "Send it, maybe it arrives"

      At any time, the receiver could lose power. Or a burst of interference could disrupt the radio link. Or a backhoe could slice through the cable. Or many other things.

      IP merely reflects this physical reality.

  • GoblinSlayer 3 hours ago

    > For many years, RDMA NICs could cache the state for only a few hundred connections; if the number of active connections exceeded the cache size, information had to be shuffled between host memory and the NIC, with a considerable loss in performance.

    A massively parallel task? Sounds like something doable with GPGPU.

  • ksec 6 hours ago

    Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities

    https://people.csail.mit.edu/alizadeh/papers/homa-sigcomm18....

  • indolering 2 hours ago

    So token ring?

  • dveeden2 3 hours ago

    Wasn't something like HOMA already tried with SCTP?

    • iforgotpassword 3 hours ago

      And QUIC. And that thing tesla presented recently, with custom silicon even.

      And as usual, hardware gets faster, better and cheaper over the next years and suddenly the problem isn't a problem anymore - if it even ever was for the vast majority of applications. We only recently got a new fleet of compute nodes with 100gbit NICs. The previous one only had 10, plus omnipath. We're going ethernet only this time.

      I remember when saturating 10gbit/s was a challenge. This time around, reaching line speed with tcp, the server didn't even break a sweat. No jumbo frames, no fiddling with tunables. And that actually was while testing with 4 years old xeon boxes, not even the final hw.

      Again, I can see how there are use cases that benefit from even lower latency, but thats a niche compared to all DC business, and I'd assume you might just want rdma in that case, instead of optimizing on top of ethernet or IP.

      • silisili 2 hours ago

        This is a solid answer, as someone on the ground. TCP is not the bogeyman people point it out to be. It's the poison apple where some folks are looking for low hanging fruit.

  • 7e 6 hours ago

    TCP was replaced in the data centers of certain FAANG companies years before this paper.

    • wmf 6 hours ago

      If they keep it secret they don't get credit for it.

      • andrewflnr 5 hours ago

        How do you figure? The right decision is the right decision, even if you don't tell people. (granting, for the sake of argument, that it is the right decision)

        • wmf 4 hours ago

          Yeah, you get the benefit of secret tech (in this case faster networking) but people shouldn't give social credit for it because that creates incentives to lie. And, sadly, tech adoption runs entirely on social proof.

    • albert_e 5 hours ago

      Curious ... replaced with what, I would like to know.

    • bushbaba 6 hours ago

      *minority of the fangs.

      • avardaro 5 hours ago

        A minority? What large tech company has not prioritized this?

        • cdchn 5 hours ago

          Which have and with what?

  • yesbut 5 hours ago

    Another thing not worth investing time into for the rest of our careers. TCP will be around for decades to come.

    • t-writescode 3 hours ago

      True! And chances are, if you're developing website software or video game software, you'll never think about these sorts of things, it'll just be a dumb pipe for you, still.

      And that's okay!

      But there are other sorts of computer people than website writers and business application devs, and they're some of the people this would be interesting for!

  • stiray 5 hours ago

    How long did we need to support ipv6? Is it supported yet and more widely in use than the ipv4, like in mobile networks where everything is stashed behind NAT and ipv4 kept?

    Another protocol, something completely new? Good luck with that, i would rather bet on global warming to put us out of our misery (/s)...

    https://imgs.xkcd.com/comics/standards.png

    • detaro 4 hours ago

      Mobile networks especially are widely IPv6, with IPv4 being translated/tunneled where still needed. (End-user connections in general skew IPv6 in many places - it's observable how traffic patterns shift with people being at work vs at home. Corporate networks without IPv6 leading to more IPv4 traffic during the day, in the evening IPv6 from consumer connections takes over)

      • stiray 3 hours ago

        Android: Settings -> About (just checked mine, 10...*), check your IP. We have 3 providers in our country, all 3 are using ipv4 "lan" for phone connectivity, behind NAT and I am observing this situation around most of EU (Germany, Austria, Portugal, Italy, Spain, France, various providers).

  • bmitc 5 hours ago

    Unrelated to this article, are there any reasons to use TCP/IP over WebSockets? The latter is such a clean, message-based interface that I don't see a reason to use TCP/IP.

    • tacitusarc 5 hours ago

      Websockets is a layer on top of TCP/IP.

      • bmitc 4 hours ago

        Yes, I know that WebSockets layer over TCP/IP. But that both misses the point and is part of the point. The reason that I ask is that WebSockets seem to almost always be used in the context of web applications. TCP/IP still seems to dominate control communications between hardware. But why not WebSockets? Almost everyone ends up building a message framing protocol on top of TCP/IP, so why not just use WebSockets which has bi-directional message framing built-in? I'm just not seeing why WebSockets aren't as ubiquitous as TCP/IP and only seem to be relegated to web applications.

        • dataviz1000 4 hours ago

          There isn't much of a difference between a router between two machines physically next to each other and a router in Kansas connecting a machine in California with a machine in Miami. The packets of data are wrapped with an address of where they are going in the header.

          WebSockets are long lived socket connection designed specifically for use on the 'web'. TCP is data sent wrapped in packets that is ordered and guaranteed delivery. This causes a massive overhead cost. This is different from UDP which doesn't guarantee order and delivery. However, a packet sent over UDP might arrive tomorrow after it goes around the world a few times.

          With fetch() or XMLHttpRequest, the client has to use energy and time to open a new HTTP connection while a WebSocket opens a long lived connection. When sending lots of bi directional messages it makes sense to have a WebSocket. However, a simple fetch() request is easier to develop. A developer needs to good reason to use the more complicated WebSocket.

          Regardless, they both send messages using TCP which ensures the order of packets and guaranteed delivery which features have a lot to do with why TCP is the first choice.

          There is UDP which is used by WebRTC which is good for information like voice or video which can have missing and unordered packets occasionally.

          If two different processes on the same machine want to communicate, they can use a Unix socket. A Unix socket creates a special file (socket file) in the filesystem, which both processes can use to establish a connection and exchange data directly through the socket, not by reading and writing to the file itself. But the Unix Socket doesn't have to deal with routing data packets.

          (ChatGPT says "Overall, you have a solid grasp of the concepts, and your statements are largely accurate with some minor clarifications needed.")

        • j16sdiz 4 hours ago

          WebSocket is fairly inefficient protocol. and it needs to deal with the upgrade from HTTP. and you still need to implement you app specific protocol. This is adding complexity without additional benefit

          It make sense only if you have an websocket based stack and don't want to maintain a second protocol.

        • wmf 4 hours ago

          Interesting point. For example, Web apps cannot speak BitTorrent (because Web apps are not allowed to use TCP) but they can speak WebTorrent over WebRTC and native apps can also speak WebTorrent. So in some sense a protocol that runs over WebSockets/WebRTC is superior because Web apps and native apps can speak it.

    • znpy 8 minutes ago

      > Unrelated to this article, are there any reasons to use TCP/IP over WebSockets?

      Performance. TCP over TCP is pretty bad.

      OpenVPN can do that (tcp-based vpn session over a tcp connection) and the documentation strongly advices against that.

    • tkin1980 5 hours ago

      Well, Websocket is over TCP, so you already need it for that.

  • freetanga 4 hours ago

    So, back to the mainframe and SNA in the data centers?

    • wmf 4 hours ago

      If Rosenblum can get an award for rediscovering mainframe virtualization, why not give Ousterhout an award for rediscovering SNA?

      (SNA was before my time so I have no idea if Homa is similar or not.)