CDC File Transfer

(github.com)

361 points | by GalaxySnail 18 hours ago ago

94 comments

  • EdSchouten 15 hours ago

    I’ve also been doing lots of experimenting with Content Defined Chunking since last year (for https://bonanza.build/). One of the things I discovered is that the most commonly used algorithm FastCDC (also used by this project) can be improved significantly by looking ahead. An implementation of that can be found here:

    https://github.com/buildbarn/go-cdc

    • Scaevolus 15 hours ago

      This lookahead is very similar to the "lazy matching" used in Lempel-Ziv compressors! https://fastcompression.blogspot.com/2010/12/parsing-level-1...

      Did you compare it to Buzhash? I assume gearhash is faster given the simpler per iteration structure. (also, rand/v2's seeded generators might be better for gear init than mt19937)

      • EdSchouten 14 hours ago

        Yeah, GEAR hashing is simple enough that I haven't considered using anything else.

        Regarding the RNG used to seed the GEAR table: I don't think it actually makes that much of a difference. You only use it once to generate 2 KB of data (256 64-bit constants). My suspicion is that using some nothing-up-my-sleeve numbers (e.g., the first 2048 binary digits of π) would work as well.

        • Scaevolus 4 hours ago

          Right, just one fewer module dependency using the stdlib RNG.

        • pbhjpbhj 13 hours ago

          The random number generation could match the first 2048 digits of pi, so if it works with _any_ random number...

          If it doesn't work with any random number, then some work better than others then intuitively you can find a (or a set of) best seed(s).

    • xyzzy_plugh 4 hours ago

      > https://bonanza.build

      I just wanted to let you know, this is really cool. Makes me wish I still used Bazel.

    • rokkamokka 12 hours ago

      What would you estimate the performance implications of using go-cdc instead of fastcdc in their cdc_rsync are?

      • EdSchouten 11 hours ago

        In my case I observed a ~2% reduction in data storage when attempting to store and deduplicate various versions of the Linux kernel source tree (see link above). But that also includes the space needed to store the original version.

        If we take that out of the equation and only measure the size of the additional chunks being transferred, it's a reduction of about 3.4%. So it's not an order of magnitude difference, but not bad for a relatively small change.

    • quotemstr 9 hours ago

      I wonder whether there's a role for AI here.

      (Please don't hurt me.)

      AI turns out to be useful for data compression (https://statusneo.com/creating-lossless-compression-algorith...) and RF modulation optimization (https://www.arxiv.org/abs/2509.04805).

      Maybe it'd be useful to train a small model (probably of the SSM variety) to find optimal chunking boundaries.

      • EdSchouten 3 hours ago

        Yeah, that's true. Having some kind of chunking algorithm that's content/file format aware could make it work even better. For example, it makes a lot of sense to chunk source files at function/scope boundaries.

        In my case I need to ensure that all producers of data use exactly the same algorithm, as I need to look up build cache results based on Merkle tree hashes. That's why I'm intentionally focusing on having algorithms that are not only easy to implement, but also easy to implement consistently. I think that MaxCDC implementation that I shared strikes a good balance in that regard.

  • MayeulC 12 hours ago

    I am quite confused; doesn't rsync already use content-defined chunk boundaries, with a condition on the rolling hash to define boundaries?

    https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

    https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

    The speed improvements over rsync seem related to a more efficient rolling hash algorithm, and possibly by using native windows executables instead of cygwin (windows file systems are notoriously slow, maybe that plays a role here).

    Or am I missing something?

    In any case, the performance boost is interesting. Glad the source was opened, and I hope it finds its way into rsync.

    • re 11 hours ago

      > doesn't rsync already use content-defined chunk boundaries, with a condition on the rolling hash to define boundaries?

      No, it operates on fixed size blocks over the destination file. However, by using a rolling hash, it can detect those blocks at any offset within the source file to avoid re-transferring them.

      https://rsync.samba.org/tech_report/node2.html

    • ohitsdom 8 hours ago

      The readme very nicely contrasts the approach with rsync.

    • sneak 12 hours ago

      rsync seems frozen in time; it’s been around for ages and there are so many basic and small quality of life improvements that could have been made that haven’t been. I have always assumed it’s like vim now: only really maintained in theory, not in practice.

      • chasil 4 hours ago

        Please bear in mind that there are [now] two distinct rsync codebases.

        The original is the GPL variant [today displaying "Upgrade required"]:

        https://rsync.samba.org/

        The second is the BSD clone:

        https://www.openrsync.org/

        The BSD version would be used on platforms that are intolerant of later versions of the GPL (Apple, Android, etc.).

      • Zardoz84 11 hours ago

        So you not used vim or neovim in the last 10 years ?

        • lftl 8 hours ago

          To be fair, there was a roughly 6 year period when vim saw one very minor release. That slow development period was the impetus for the fork of Neovim.

          • Zardoz84 8 hours ago

            I know. I use Neovim. But since that, and thanks to Neovim, Vim has speedup and got some improvements.

            • dotancohen 7 hours ago

              Time for neorsync.

              That said, VIM 8 was terrific.

  • wheybags 14 hours ago

    If anyone else was left wondering about the details of how CDC actually generates chunks, I found these two blog posts explained the idea pretty clearly:

    https://joshleeb.com/posts/content-defined-chunking.html

    https://joshleeb.com/posts/gear-hashing.html

    • jcul 9 hours ago

      Thanks, I was puzzled by that. They kind of gloss over it in the original link.

      Looking forward to reading those.

  • rekttrader 17 hours ago

    Nice to see Stadia had some long term benefit. It’s a shame they don’t make a self hosted version but if you did that it’s just piracy in today’s drm world.

    • kanemcgrath 17 hours ago

      for self-hosted game streaming you can use moonlight + sunshine, they work really well in my experience.

      • BrokenCogs 6 hours ago

        Exactly my experience too. I easily get 60fps at 1080p over wireless LAN with moonlight + sunshine. Parsec is also another option

    • sheepscreek 15 hours ago

      Probably wouldn’t have been feasible - I heard developers had to compile their games with Stadia support. Maybe it was an entirely different platform, with its own alternative to DirectX, or maybe had some kind of lightweight emulation (such as Proton) but I remember vaguely the few games I played had custom stadia key bindings (with stadia symbols). They would display like that within the game. So definitely some customization did happen.

      This is unlike the model that PlayStation, Xbox and even Nvidia are following - I don’t know about Amazon Luna.

      • MindSpunk 15 hours ago

        Stadia games were just run on Linux with Vulkan + some extra Stadia APIs for their custom swapchain and other bits and pieces. Stadia games were basically just Linux builds.

      • jakebasile 15 hours ago

        As I understand it, GeForce Now actually does require changes to the game to run in the standard and until recently only option of "Ready To Play". This is the supposed reason that new updates to games sometimes take time to get released on the service, since either the developers themselves or Nvidia needs to modify it to work correctly on the service. I have no idea if this is true, but it makes sense to me.

        They recently added "Install to Play" where you can install games from Steam that aren't modified for the service. They charge for storage for this though.

        Sadly, there's still tons of games unavaiable because publishers need to opt in and many don't.

        • TiredOfLife 4 hours ago

          GeForce Now doesn't require any changes.

      • numpad0 14 hours ago

        They did have a dev console based on a Lenovo workstation, as well as off-menu AMD V340L 2x8GB GPUs, both later leaked into Internet auctions. So some hardware and software customizations had definitely happened.

    • nolok 14 hours ago

      For self hosted remote streaming of game look at Moonlight / Sunshine (Apollo)

      Stadia required special version of games, so it wouldn't be that useful

      • asmor 14 hours ago

        It's a shame that virtual / headless displays are such a mess on both Linux and Windows. I use a 32:9 ultrawide and stream to 16:9/16:10 devices, and even with hours of messing around with an HDMI dummy and kscreen-doctor[1] it was still an unreliable mess. Sometimes it wouldn't work when the machine was locked, and sometimes Sunshine wouldn't restore the resolution on the physical monitor (and there's no session timeout either).

        Artemis is a bit better, but it still requires per-device setup of displays since it somehow doesn't disable the physical output next to the virtual one. Those drivers also add latency to the capture (the author of looking glass really dislikes them because they undo all the hard work of near-zero latency).

        [1]: https://github.com/acuteaura/universe/blob/main/systems/_mod...

        • heavyset_go 9 hours ago

          On Linux with an AMD i/dGPU, you can set the `virtual_display` module parameter for `amdgpu`[1] and do what you want without the need for an HDMI dummy or weird software. It's also hardware accelerated.

          > virtual_display (charp)

          > Set to enable virtual display feature. This feature provides a virtual display hardware on headless boards or in virtualized environments. It will be set like xxxx:xx:xx.x,x;xxxx:xx:xx.x,x. It’s the pci address of the device, plus the number of crtcs to expose. E.g., 0000:26:00.0,4 would enable 4 virtual crtcs on the pci device at 26:00.0. The default is NULL.

          [1]https://www.kernel.org/doc/html/latest/gpu/amdgpu/module-par...

        • nolok 13 hours ago

          Use Apollo (a fork of Sunshine) : https://github.com/ClassicOldSong/Apollo

          > Built-in Virtual Display with HDR support that matches the resolution/framerate config of your client automatically

          It includes a virtual screen driver, and it handles all the crap (it can disable your physical screen when streaming and re enable after, it can generate the virtual screen by client to match the client's needs, or do it by game, or ...)

          I stream from my main pc to both my laptop and my steamdeck, and each get the screen that matches them without having to do anything more than connect to it with moonlight.

          • asmor 13 hours ago

            Artemis/Apollo are mentioned in the post above - yeah they work better than the out of box experience, but you still have to configure your physical screen to be off for every virtual display. It unfortunately only runs on Windows and my machine usually doesn't. I also only have one dGPU and a Raphael iGPU (which are sensitive to memory overclocks) and I like the Linux gaming experience for the most part, so while I did have a working gaming VM, it wasn't for me (or I'd want another GPU).

    • mrguyorama 3 hours ago

      I don't understand, "self hosted stadia" is just one of the myriad of services and tools that do literally that.

      Steam has game streaming built in and works very well. Both Nvidia and AMD built this into their GPU drivers at one point or another (I think the AMD one was shut down?)

      Those are just the solutions I accidentally have installed despite not using that functionality. You can even stream games from the steam deck!

      Sony even has a system to let you stream your PS4 to your computer anywhere and play it. I think Microsoft built something similar for Xbox.

    • oofbey 17 hours ago

      What do you mean piracy in the a DRM world. Like being able to share your own PC games through the cloud?

      • killingtime74 10 hours ago

        You can share the games you authored all you like. If you bought a license to play them that's another story.

    • laidoffamazon 15 hours ago

      Stadia was sadly engineered in such a way that this is impossible.

      Speaking of which, who thought up the idea to use custom hardware for this that would _already be obsolete_ a year later? Who considered using Linux native instead of a compat layer? Why did the original Stadia website not even have a search bar??

    • jMyles 17 hours ago

      > it’s just piracy in today’s drm world

      ...which is more important / needed than ever. I encourage every who asks to get my music from bit torrent instead of spotify.

      • MyOutfitIsVague 17 hours ago

        Why not something like Bandcamp, or other DRM-free purchase options?

        I'm not above piracy if there's no DRM free option (or if the music is very old or the artist is long dead), but I still believe in supporting artists who actively support freedom.

      • MaxikCZ 15 hours ago

        So you create and seed your torrents with your music, and present them prominently on your site?

        • jMyles 5 hours ago

          I was doing that for a while, and running a seedbox. However, on occasions when the seedbox was the only seeder, clients were unable to begin the download, for reasons I've never figured out. If I also seeded from my desktop, then fan downloads were being fed by both the desktop and the seedbox. But without the desktop, the seedbox did nothing.

          I need to revisit this in the next few weeks as I release my second record (which, if I may boast, has an incredible ensemble of most of my favorite bluegrass musicians on it; it was a really fun few days at the studio).

          Currently I do pin all new content to IPFS and put the hashes in the content description, as with this video of Drowsy Maggie with David Grier: https://www.youtube.com/watch?v=yTI1HoFYbE0

          Another note: our study of Drowsy Maggie was largely made possible by finding old-and-nearly-forgotten versions in the Great78 project, which of course the industry attempted to sue out of existence on an IP basis. This is another example of how IP is a conceptual threat to traditional music - we need to be able to hear the tradition in order to honor it.

  • tgsovlerkhgsel 12 hours ago

    Key sentence: "The remote diffing algorithm is based on CDC [Content Defined Chunking]. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s)."

  • AnonC 15 hours ago

    Does anyone know if there’s work being done to integrate this into the standard rsync tool (even as an optional feature)? It seems like a very useful improvement that ought to be available widely. From this website it seems a bit disappointing that it’s not even available for Linux to Linux transfers.

  • bilekas 11 hours ago

    This is actually kind of cool, I've implemented my own version of this for my job and seems to be something that's important when the numbers gets tight, but if I remember correctly for their case i guess, wouldn't it have been easier to work from rsynch?

    > scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.

    I havent tried it myself but doesnt this already suit that requirement ? https://docs.rc.fas.harvard.edu/kb/rsync/

    > Compression If the SOURCE and DESTINATION are on different machines with fast CPUs, especially if they’re on different networks (e.g. your home computer and the FASRC cluster), it’s recommended to add the -z option to compress the data that’s transferred. This will cause more CPU to be used on both ends, but it is usually faster.

    Maybe it's not fast enough, but seems a better place to start than scp imo.

    • regularfry 11 hours ago

      > The remote diffing algorithm is based on CDC. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s).

    • rincebrain 10 hours ago

      rsync in my experience is not optimized for a number of use cases.

      Game development, in particular, often involves truly enormous sizes and numbers of assets, particularly for dev build iteration, where you're sometimes working with placeholder or unoptimized assets, and debug symbol bloated things, and in my experience, rsync scales poorly for speed of copying large numbers of things. (In the past, I've used naive wrapper scripts with pregenerated lists of the files on one side and GNU parallel to partition the list into subsets and hand those to N different rsync jobs, and then run a sync pass at the end to cleanup any deletions.)

      Just last week, I was trying to figure out a more effective way to scale copying a directory tree that was ~250k files varying in size between 128b and 100M, spread out across a complicatedly nested directory structure of 500k directories, because rsync would serialize badly around the cost of creating files and directories. After a few rounds of trying to do many-way rsync partitions, I finally just gave the directory to syncthing and let its pregenerated index and watching handle it.

      • jmuhlich 5 hours ago

        Try this: https://alexsaveau.dev/blog/projects/performance/files/fuc/f...

        > The key insight is that file operations in separate directories don’t (for the most part) interfere with each other, enabling parallel execution.

        It really is magically fast.

        EDIT: Sorry, that tool is only for local copies. I just remembered you're doing remote copies. Still worth keeping in mind.

  • est 14 hours ago

    I wonder if this could be applied to git.

    The git blob was hashed with a header of decimal length, and you change a slight bit of content, you have to calculate the hash from start again.

    Something like CDC would improve this alot.

  • velcrovan 6 hours ago

    > Download the precompiled binaries from the latest release to a Windows device and unzip them. The Linux binaries are automatically deployed to ~/.cache/cdc-file-transfer by the Windows tools. There is no need to manually deploy them.

    Interesting, so unlike rsync there is no need to set up a service on the destination Linux machine. That always annoyed me a bit about rsync.

    • justinsaccount 5 hours ago

      The most common use for rsync is to run it over ssh where it starts the receiving side automatically. cdc is doing the exact same thing.

      You were misinformed if you thought using rsync required setting up an rsync service.

  • charleshwang 6 hours ago

    Is this how IBM Aspera works too? I was working QA at a game publisher a while ago, and they used it to upload some screen recordings. I didn't understand how it worked, but it was exceeding the upload speeds of the regular office internet.

    https://www.ibm.com/products/aspera

  • shae 5 hours ago

    I've read lots about content defined chunking and recently heard about monoidal hashing. I haven't tried it yet, but monoidal hashing reads like it would be all around better, does anyone know why or why not?

  • Sammi 11 hours ago

    It's dead and archived atm, but it looks like a good candidate for revival as an actual active open source project. If you ever wanted to work on something that looks good on your resume, then this looks like your chance. Basically just get it running and released on all major platforms.

  • mikae1 16 hours ago

    > cdc_rsync is a tool to sync files from a Windows machine to a Linux device, similar to the standard Linux rsync.

    Does this work Linux to Linux too?

  • modeless 16 hours ago

    Does Steam do something like this for game updates?

  • ksherlock 5 hours ago

    They should have duck ducked the initialism. CDC is Control Data Corporation.

  • theamk 17 hours ago

    This CDC is "Content Defined Chunking" - fast incremental file transfer.

    Use case is to copy file over slow net, but the previous version is already there, so one can save time by only sending changed parts of the file.

    Not to be confused with USB CDC ("communications device class"), an USB device protocol used to present serial ports and network cards. It can also be used to transfer files, the old PC-to-PC cables used it by implementing two network cards connected to each other.

    • oofbey 17 hours ago

      The clever trick is how it recognizes insertions. The standard trick of computing hashes on fixed sized blocks works efficiently for substitutions but is totally defeated by an insertion or deletion.

      Instead with CDC the block boundaries are define by the content, so an insertion doesn’t change the block boundary, so it can tell the subsequent blocks are unchanged. I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

      • teraflop 17 hours ago

        Probably worth noting that ordinary rsync can also handle insertions/deletions because it uses a rolling hash. Rsync's method is bandwidth-efficient, but not especially CPU-efficient.

      • adzm 11 hours ago

        > I haven’t read the CDC paper but I’m guessing they just use some probabilistic hash function to define certain strings as block boundaries.

        You choose a number of bits (say, 12) and then evenly distribute these in a 48-bit mask; if the hash at any point has all these bits on, that defines a boundary.

    • NooneAtAll3 16 hours ago

      not to be confused with Center of Disease Control

  • 0xfeba 7 hours ago

    the name reminds me of Microsoft's RDC, Remote Differential Compression.

    https://en.wikipedia.org/wiki/Remote_Differential_Compressio...

  • phyzome 9 hours ago

    You can see something similar in use in the borg backup tool -- content-defined chunking, before deduplication and encryption.

  • maxlin 16 hours ago

    Having dabbled in trying to make a quick delta patch system like Steam's, which required me to understand delta patching methods and made small patches to big files in a 10gb+ installation in a few seconds, this is sure is quite interesting!

    I wonder if Steam ever decides to supercharge their content handling with some user-space filesystem stuff. With fast connections, there isn't really a reason they couldn't launch games in seconds, streaming data on-demand with smart pre-caching steering based on automatically trained access pattern data. And especially with finely tuned delta patching like this, online game pauses for patching could be almost entirely eliminated. Stop & go instead of a pit stop.

    • fsfod 14 hours ago

      Someone already created that[1] using custom kernel driver and there own CDN, but they seem to of abandoned it[2], maybe because they would of attracted Valve's wrath trying to monetized it.

      [1] https://web.archive.org/web/20250517130138/https://venusoft....

      [2] https://venusoft.net/#home

      • maxlin 4 hours ago

        That's actually quite interesting. Not entirely what I had in mind but close! My version would have only the first boot be a bit slow, but the aspect of dynamically replacing local content there is cool.

        This would be extra cool for LAN parties with good network hardware

    • Zekio 8 hours ago

      steam game installs are bottlenecked by cpu speed these days due to the heavy compression, so doubt it be much faster

      • maxlin 4 hours ago

        Well, the amount of compression isn't set in stone, obviously a system like this would run with a less compressed dataset to balance game boot time, time taken away from running the game by compression, and scale on available bandwidth.

        With low bandwidth just downloading the whole thing while having enough compression to 80% saturate the local system would be optimal instead, sure.

  • supportengineer 16 hours ago

    Cygwin? Does anyone still use that?

    • cheema33 6 hours ago

      Cygwin has its benefits over WSL. e.g. It does not run in a VM for example and therefore does not suffer from the resulting performance penalty.

  • exikyut 12 hours ago

    I'm curious: what does MUC stand for? :)

  • claytongulick 16 hours ago

    I ran into some of those issues with the chunk size and hash misses when writing bitsync [1], but at the time I didn't want to get too clever with it because I was focused on rsync algorithm compatibility.

    This is a cool idea!

    [1] https://github.com/claytongulick/bit-sync

  • laidoffamazon 15 hours ago

    As I've gotten further in my career I've started to wonder - how many engineering quarters did it take to build this for their customers? How did they manage to get this on their own roadmap? This seems like a lot of code surface area for a fairly minimal optimization that would be redundant with a different development substrate (like running Windows on Stadia like how Amazon Luna worked...)

    • grodes 13 hours ago

      You are thinking like a manager, but this (as with most of the good things in life) has been built by doers, artisans, and engineers (developers).

      This is a problem interesting enough, with huge potential benefits for humanity if it manages to improve anything, which it did.

    • jayd16 14 hours ago

      It's easy to get work on this problem. Any effort that shortens game deploy time will be highly visible. It's something every game needs, and every member of the team deals with.

      • laidoffamazon 5 hours ago

        Im sympathetic to this idea but it seems like this is a situation that most game developers don’t have because they just develop locally. Sometimes they do need to push to a console which this could help with if Microsoft or Sony built this into their dev kit tooling.

  • ur-whale 15 hours ago

    Great initiative, especially the new sync algorithm, but giant hurdles to adoption:

    - only works on a weird combo of (src platform / dst platform). Why???? How hard is it to write platform-independent code to read/write bytes and send them over the wire in 2025?

    - uses bazel, an enormous, Java-based abomination, to build.

    Fingers crossed that these can be fixed, or this project is dead in the water.

    • jve 12 hours ago

      Hey the repo is archived and as I read the tool was meant to solve one specific scenario. Not everything has to please the public.

      The great thing is googlers could make such a tool and publish it in the first place. So you can improve it to use it in your scenario. Or become maintainer of such a tool.

    • maccard 11 hours ago

      > only works on a weird combo of (src platform / dst platform). Why????

      Stadia ran on linux, and 99.9999999% of game development is done on windows (and cross compiled for linux).

      > Fingers crossed that these can be fixed, or this project is dead in the water.

      The project was archived 9 months ago, and hasn't had a commit in 2 years. It's already dead.

    • hobs 14 hours ago

      First thing might be considered a bug by googles, but everyone I have talked to LOVED their bazel or at least thought of it as superior to any other tool to do the same stuff.

      Literally tonight my buddy was talking about how months long plan to introduce bazel into his companies infra.

  • syngrog66 9 hours ago

    CDC is an unfortunately chosen name

  • janpmz 14 hours ago

    Tailscale and python3 -m http.server 1337 and then navigating the browser to ip:1337 is a nice way to transfer files too (without chunking). I've made an alias for it alias serveit="python3 -m http.server 1337"