An eBPF Loophole: Using XDP for Egress Traffic

(loopholelabs.io)

93 points | by loopholelabs a day ago ago

28 comments

  • tptacek 2 hours ago

    From 2022: https://www.samd.is/2022/06/13/egress-XDP.html

    You can also use XDP for outgoing packets for tap interfaces.

  • loopholelabs a day ago

    XDP (eXpress Data Path) is the fastest packet processing framework in linux - but it only works for incoming (ingress) traffic. We discovered how to use it for outgoing (egress) traffic by exploiting a loophole in how the linux kernel determines packet direction. Our technique delivers 10x better performance than current solutions, works with existing Docker/Kubernetes containers, and requires zero kernel modifications.

    This post not only expands on the overall implementation but also outlines how existing container and VM workloads can immediately take advantage with minimal effort and zero infrastructure changes.

    • rtkaratekid 2 hours ago

      Forgive me of my ignorance, but is XDP faster than DPDK for packet processing? It seems like DPDK has had a lot of work done for hardware optimizations that allow speeds that I can’t recall XDP being able to do. I have not looked too deeply into this though, so I’m very open to being wrong!

      • toprerules 29 minutes ago

        DPDK is a framework with multiple backends, on the receive side it can use XDP to intercept packets.

        You can't compare the efficiency of the frameworks without talking about the specific setups on the host. The major advantage of XDP is that it is completely baked into the kernel. All you need to do is bring your eBPF program and attach it. DPDK requires a great deal of setup and user space libraries to work.

  • AlexB138 37 minutes ago

    Really good, and glad that you're taking this technique further into a docker network plugin. I wouldn't be surprised to see a Kubernetes CNI appear using this approach, seems entirely viable unless I am missing something.

    I'll definitely be coming to check you all out at Kubecon.

    • shivanshvij 27 minutes ago

      Awesome we’ll be looking forward to it!

  • notherhack 42 minutes ago

    For NAT (Network Address Translation) or any other packet header modifications, you need to recalculate checksums manually

    Why doesn’t checksum offload in the NIC take care of that?

  • shivanshvij 21 hours ago

    Hi HN, Shivansh (founder) here, happy to answer any questions folks might have about the implementation and the benchmarks!

    • drewg123 an hour ago

      I come from a very different world (optimizing the FreeBSD kernel for the Netflix CDN, running on bare metal) but performance leaps like this are fascinating to me.

      One of the things that struck me when reading this with only general knowledge of the linux kernel is: What makes things so terrible? Is iptables really that bad? Is something serialized to a single core somewhere in the other 3 scenarios? Is the CPU at 100% in all cases? Is this TCP or UDP traffic? How many threads is iperf using? It would be cool to see the CPU utilization of all 4 scenarios, along with CPU flamegraphs.

      • toprerules 21 minutes ago

        In the case of XDP, the reason it's so much faster is that it requires 0 allocations in the most common case. The DMA buffers are recycled in a page pool that's already allocated and mapped at least queue depth buffers for each hardware queue. XDP is simply running on the raw buffer data, then telling the driver what the user wants to do with the buffer. If all you are doing is rewriting an IP address, this is incredibly fast.

        In the non XDP case (ebpf on TC) you have to allocate a sk buff and initialize it. This is very expensive, there's tons of accounting in the struct itself, and components that track every sk buff. Then there are the various CPU bound routing layers.

        Overall the network core of Linux is very efficient. The actual page pool buffer isn't copied until the user reads data. But there's a million features the stack needs to support, and all of these cost efficiency.

      • shivanshvij 28 minutes ago

        As far as we can tell, it’s a mixture of a lot of things. One of the questions I got asked was how useful this is if you have a smaller performance requirement than 200Gbps (or, maybe a better way to put it, what if your host is small and can only do 10Gbps anyways).

        You’ll have to wait for the follow up post with the CNI plugin for the full self-reproducible benchmark, but on a 16 core EC2 instance with a 10Gbps connection IPtables couldn’t do more than 5Gbps of throughput (TCP!), whereas again XDP was able to do 9.84Gbps on average.

        Furthermore, running bidirectional iPerf3 tests in the larger hosts shows us that both ingress and egress throughput increase when we swap out iptables on just the egresss path.

        This is all to say, our current assumption is when the CPU is thrashed by iPerf3, the RSS queues, the Linux kernel’s ksoftirqd threads, etc. all at once it destroys performance. XDP is moving some of the work outside the kernel, while at the same time the packet is only processed through the kernel stack half as much as without XDP (only on the path before or after the veth).

        It really is all CPU usage in the end as far as I can tell. It’s not like our checksumming approach is any better than what the kernel already does.

      • tux1968 an hour ago

        It's also a bit depressing that everyone is still using the slower iptables, when nftables has been in the kernel for over a decade.

        • billfor a few seconds ago

          Iptables uses nftables under the hood.

        • shivanshvij 28 minutes ago

          Actually the latest benchmarks were ran on a Fedora 43 host, which as far as I can tell uses the nftables backend for iptables!

  • ZiiS an hour ago

    They say "By the time a packet reaches the TC hook, the kernel has already processed it through various subsystems for routing, firewalling, and even connection tracking." but surely this is also true before it reaches the VETH?

  • docapotamus 2 hours ago

    Great post.

    In some scenarios veth is being replaced with netkit for a similar reason. Does this impact how you're going to manage this?

  • ZiiS an hour ago

    I understand they are attached to the phrase "loophole" but it feels fairly like they are using it as designed to me?

    • seneca 40 minutes ago

      XDP is intended only for inbound traffic. They are exploiting veth pairs to make outbound traffic "look like" inbound traffic. That's the "loophole".

      • tptacek 20 minutes ago

        It's really not a loophole. I think this might literally be in the xdp-tutorials repo.

  • iSloth an hour ago

    Also wondering, why not just use DPDK?

    • tptacek an hour ago

      First, "just use" is doing a lot of work in that sentence, because DPDK is much harder to use than XDP. The authors of this blog were surprised they had to do their own checksumming, for instance.

      Maybe more importantly: they're not building a middlebox. DPDK ultra-high performance comes in part from polling. It's always running. XDP is just an extension to the existing network driver.

    • toprerules 19 minutes ago

      XDP is built into the kernel. DPDK is a huge framework that invasively bypasses the kernel and has to remain compatible as an external project.

    • ZiiS an hour ago

      I think they are getting a lot of value from the rest of the Kernel's networking (VETH/namespaces etc talking to containers).

  • sim7c00 23 minutes ago

    i really love this one. its a really elegant and well informed solution. one of the nicest finds ive seen in a while was a pleasure reading how it works! thanks a lot

  • toprerules 32 minutes ago

    I think the title is a little disingenuous and the idea of using a redirect is certainly not novel. The solution for XDP egress should be able to handle all host egress including sr-iov traffic. This works with a very specific namespace driven topology.

  • kosolam 2 hours ago

    Hey I can’t browse the link crashes on ios

  • okelahbos28 an hour ago

    Main di jo777 bikin puas ketagihan

  • betaby 21 minutes ago

    As I understand they implemented NAT using eBPF?