We shrunk our Javascript monorepo git size

(jonathancreamer.com)

326 points | by kwantaz 4 days ago ago

169 comments

  • tux3 4 days ago

    For those wondering where this new git-survey command is, it's actually not in git.git yet!

    The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667

    • masklinn 4 days ago

      I assume full-name-hash and path-walk are also only in the fork as well (or in git HEAD)? Can't see them in the man pages, or in the 2.47 changelog.

  • yunusabd 4 days ago

    > For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

    What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?

    > This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo

    The sentence seems to be cut off.

    Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.

    • anon-3988 4 days ago

      > For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

      I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".

      I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.

      • peebeebee 4 days ago

        The European Union doesn't like when a file get too big and powerful. It needs to be broken apart in order to give smaller files a chance of success.

        • wizzwizz4 4 days ago

          Ever since they enshrined the Unix Philosophy into law, it's been touch-and-go for monorepotic corporations.

        • _joel 4 days ago

          People foolishly thought the G in GDPR stood for "general" when it's actually GIANT.

    • acdha 4 days ago

      My guess is that “Europe” is being used as a proxy for “high latency, low bandwidth” – especially if the person in question uses a VPN (especially one of those terrible “SSL VPN” kludges). It’s still surprisingly common to encounter software with poor latency handling or servers with broken window scaling because most of the people who work on them are relatively close and have high bandwidth connection.

      • jerf 4 days ago

        And given the way of internal corporate networks, probably also "high failure rate", not because of "the internet", but the pile of corporate infrastructure needed for auditability, logging, security access control, intrusion detection, maxed out internal links... it's amazing any of this ever functions.

        • acdha 4 days ago

          Or simply how those multiply latency - I’ve seen enterprise IT dudes try to say 300ms LAN latency is good because nobody wants to troubleshoot their twisted mess of network appliances and it’s not technically down if you’re not getting an error…

          (Bonus game: count the number of annual zero days they’re exposed to because each of those vendors still ships 90s-style C code)

      • sroussey 4 days ago

        Or high packet loss.

        Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).

        Normal websites would become super slow for any pc or phone in the house.

        But git… git would fail to clone anything not really small.

        My fix was to unplug the modem and router and plug back in. :)

        It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.

        Eventually when git started misbehaving I restarted the router to fix.

        And now I have a new router. :)

      • hinkley 3 days ago

        Sounds, based on other responders, like high latency high bandwidth, which is a problem many of us have trouble wrapping our heads around. Maybe complicated by packet loss.

        After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.

    • thrance 4 days ago

      The repo is probably hosted on the west coast, meaning it has to cross the Atlantic whenever you clone it from Europe?

    • benkaiser 3 days ago

      I can actually weigh in here. Working from Australia for another team inside Microsoft with a large monorepo on Azure devops. I pretty much cannot do a full (unshallow) clone of our repo because Azure devops cloning gets nowhere close to saturating my gigabit wired connection, and eventually due to the sheer time it takes cloning something will hang up on either my end of the Azure devops end to the point I would just give up.

      Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.

    • tazjin 4 days ago

      > What's up with folks in Europe that they can't clone a big repo, but others can?

      They might be in a country with underdeveloped internet infrastructure, e.g. Germany))

      • avianlyric 3 days ago

        I do t think there’s any country in Europe with internet infrastructure as underdeveloped as the US. Most of Europe has fibre-to-the-premise, and all of Europe has consumer internet packages that are faster and cheaper than you’re gonna find anywhere in the U.S.

        • tazjin 3 days ago

          There's (almost) no FTTH in Germany. The US used to be as bad as Germany, but it has improved significantly and is actually pretty decent these days (though connection speed is unevenly distributed).

          Both countries are behind e.g. Sweden or Russia, but Germany by a much larger margin.

          There's some trickery done in official statistics (e.g. by factoring in private connections that are unavailable to consumers) to make this seem better than it is, but ask anyone who lives there and you'll be surprised.

          • rurban 2 days ago

            The east has fibre everywhere, but the west is still a developing country(side). Shipping code on a truck would be faster, if you are not on some academic fibre net

  • tazjin 4 days ago

    I just tried this on nixpkgs (~5GB when cloned straight from Github).

    The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.

    Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...

  • eviks 4 days ago

    upd: silly mistake - file name does not include its full path

    The explanation probably got lost among all the gifs, but the last 16 chars here are different:

    > was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!

    • tux3 4 days ago

      Derrick provides a better explanation in this cover letter: https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitg...

      (See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)

      The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.

      Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.

      Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)

      • lastdong 4 days ago

        Thank you! I ended up having to look at the PR to make any sense of the blog post, but your explanation and links makes things much clearer

        • jonathancreamer 2 days ago

          I'll update the post with this clarity too. Thanks!

    • derriz 4 days ago

      I wish they had provided an actual explanation of what exactly was happening and skipped all the “color” in the story. By filename do they mean path? Or is it that git will just pick any file with a matching name to generate a diff? Is there any pattern to the choice of other file to use?

    • js2 3 days ago

      > file name does not include its full path

      No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:

      https://github.com/git-for-windows/git/pull/5157/commits/d5c...

      Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.

      The grouping algorithm puts less weight on each character the further it is from the right-side of the name:

        hash = (hash >> 2) + (c << 24)
      
      Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:

      https://go.dev/play/p/JQpdUGXdQs7

      Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.

      • rurban 2 days ago

        Sounds like it needs to be fixed to FNV1a

        • js2 2 days ago

          No, the problem isn't the hash. It does what it was designed to do. It's just that it was optimal for a particular use case that fits the Linux kernel better than Microsoft's use case. Switching the hash wouldn't improve either situation. If you want to understand this deeper, see the linked PRs.

      • eviks 3 days ago

        Thanks for the deep dive!

    • daenney 4 days ago

      File name doesn’t necessarily include the whole path. The last 16 characters of CHANGELOG.md is the full file name.

      If we interpret it that way, that also explains why the filepathwalk solution solves the problem.

      But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.

      • floam 4 days ago

        It did shrink Chromium’s repo quite a bit!

      • eviks 4 days ago

        yes, this makes sense, thanks for pointing it out, silly confusion on my part

    • p4bl0 4 days ago

      I was also bugged by that. I imagine that the meta variables foo and bar are at fault here, and that probably the actual package names had a common suffix like firstPkg and secondPkg. A common suffix of length three is enough in this case to get 16 chars in common as "/CHANGELOG.md" is already 13 chars long.

    • jonathancreamer 2 days ago

      Sorry about the gifs. Haha. And yeah I guess my understanding wasn't quite right either reading the reply to this thread, I'll try to clean it up in the post.

  • jakub_g 4 days ago

    The article mentions Derick Stolee who dig the digging and shipped the necessary changes. If you're interested in git internals, shrinking git clone sizes locally and in CI etc, Derrick wrote some amazing blogs on GitHub blog:

    https://github.blog/author/dstolee/

    See also his website:

    https://stolee.dev/

    Kudos to Derrick, I learnt so much from those!

  • fragmede 4 days ago

    > Large blobs happens when someone accidentally checks in some binary, so, not much you can do

    > Retroactively, once the file is there though, it's semi stuck in history.

    Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.

    Far from ideal, but better than having a large not-even-used file in git.

    • abound 4 days ago

      There's also BFG (https://rtyley.github.io/bfg-repo-cleaner/) for people like me who are scared of filter-branch.

      As someone else noted, this is about small, frequently changing files, so you could remove old versions from the history to save space, and use LFS going forward.

    • larusso 4 days ago

      The main issue is not a binary file that never changes. It’s the small binary file that changes often.

    • cocok 4 days ago

      filter-repo is the recommended way these days:

      https://github.com/newren/git-filter-repo

    • lastdong 4 days ago

      It’s easier to blame Linus.

  • develatio 4 days ago

    Hacking Git sounds fun, but isn't there a way to just not have 2.500 packages in a monorepo?

    • hinkley 3 days ago

      Code line count tends to grow exponentially. The bigger the code base, the more unreasonable it is to expect people not to reinvent an existing wheel, due to ignorance of the code or fear of breaking what exists by altering it to handle your use case (ignorance of the uses of the code).

      IME it takes less time to go from 100 modules to 200 than it takes to go from 50 to 100.

    • Cthulhu_ 4 days ago

      Yeah, have 2500 separate Git repos with all the associated overhead.

      • develatio 4 days ago

        Can’t we split the packages into logical groups and maybe have 20 or 30 monorepos of 70-100 packages? I doubt that all the devs involved in that monorepo have to deal with all the 2500 packages. And I doubt that there is a circular dependency that requires all of these packages to be managed in a single monorepo.

        • smashedtoatoms 4 days ago

          People act like managing lots of git repos is hard, then run into monorepo problems requiring them to fix esoteric bugs in C that have been in git for a decade, all while still arguing monorepos are easy and great and managing multiple repos is complicated and hard.

          It's like hammering a nail through your hand, and then buying a different hammer with a softer handle to make it hurt less.

          • crazygringo 4 days ago

            > all while still arguing monorepos are easy and great

            I don't know anyone who says monorepos are easy.

            To the contrary, the tooling is precisely the hard part.

            But the point is that the difficulty of the tooling is a lot less than the difficulty of managing compatibility conflicts between tons of separate repos.

            Each esoteric bug in C only needs to be fixed once. Whereas your version compatibility conflict this week is going to be followed by another one next week.

            • wavemode 3 days ago

              At Amazon, there is no monorepo.

              And the tooling to handle this is not even particularly conceptually complicated - a "versionset" is a set of versions - a set of pointers to a particular commit of a repository. When you build and deploy an application, what you're building is a versionset containing the correct versions of all its dependencies. And pull requests can span across multiple repositories.

              Working at Amazon had its annoyances, but dependency management across repos was not one of them.

              • spankalee 2 days ago

                > And pull requests can span across multiple repositories

                This bit is doing a lot of work here.

                How do you make commits atomic? Is there a central commit queue? Do you run the tests of every dependent repo? How do you track cross-repo dependencies to do that? Is there a central database? How do you manage rollbacks?

            • HdS84 3 days ago

              Thad exactly the problem. At least tooling can solve mono repo problems. But commits , which should span multiple repos, have no tooling at all. Except pain. Lots of pain.

          • Vilian 3 days ago

            Don't forget that git was made for Linux and Linux isn't a monorepo and works great with tens of thousands of devs per release

            • arp242 3 days ago

              > Linux isn't a monorepo

              I assume you meant to write "is" there?

      • hinkley 3 days ago

        Changing 100 CI pipelines is a giant pain in the ass. The third time I split the work with two other people. The 4th time someone wrote a tool and switched to a config file in the repo. 2500 is nuts. How do you even track red builds?

    • lopkeny12ko 4 days ago

      This was exactly my first thought as well. This seems like an entirely self-manufactured problem.

      • hinkley 3 days ago

        When you have hundreds of developers you’re going to get millions of lines of code. Thats partly Parkinson’s Law but also we have not fully perfected the three way merge, encouraging devs spread out more than intrinsically necessary in order to avoid tripping over each other.

        If you really dig down into why we code the way we do, the “best practices” in software development, about half of them are heavily influenced by merge conflict, if not the primary cause.

        If I group like functions together in a large file, then I (probably) won’t conflict with another person doing an unrelated ticket that touches the same file. But if we both add new functions at the bottom of the file, we’ll conflict. As long as one of us does the right thing everything is fine.

  • snthpy 4 days ago

    Thanks for this post. Really interesting and a great win for OSS!

    I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.

    I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?

  • wodenokoto 4 days ago

    Nice to see that Microsoft is dog-fooding Azure DevOps. It seems that more and more Azure services only have native connectors to GitHub so I actually thought it was moving towards abandonware.

  • issung 4 days ago

    Having someone in arms reach to help out that knows the inner workings of Git so much must be a lovely perk of working on such projects at companies of this scale.

    • jonathanlydall 4 days ago

      Certainly being in an org which has close ties to entities like GitHub helps, but any team in any org with that number of developers can justify the cost of bringing in a highly specialized consultant to solve an almost niche problem like this.

  • nkmnz 4 days ago

    > we have folks in Europe that can't even clone the repo due to it's size

    Officer, I'd like to report a murder committed in a side note!

  • dizhn 4 days ago

    They call him Linux Torvalds over there?

  • bubblesnort 4 days ago

        > We work in a very large Javascript monorepo at Microsoft we colloquially call 1JS.
    
    I used to call it office.com.. Teams is the worst offender there. Even a website with a cryptominer on it runs faster than that junk.
    • wodenokoto 4 days ago

      We were all impressed with google docs, but office.com is way more impressive.

      Collaborative editing between a web app, two mobile anpps and a desktop app with 30 years of backwards compatibility and it pretty much just works. No wonder that took a lot of JavaScript!

      • esperent 4 days ago

        We use MS Teams at my company. The Word and Excel in the Windows Teams app are so buggy that I can almost never successfully open a file. It just times out and eventually shows a "please try again later" message nearly every time. I've uninstalled and reinstalled the Teams app four or five times trying to fix this.

        We've totally given up any kind of collaborative document editing because it's too frustrating, or we use Notion instead, which for all it's fault, at least the basic stuff like loading a bloody file works...

        • acdha 4 days ago

          This is specific to your company’s configuration - likely something related to EDR or firewall policies.

          • esperent 4 days ago

            I'm the one who set it up. It's a small team of 20 people. I've done basically no setup beyond the minimum of following docs to get things running. We've had nonstop problems like this since the very start. Files don't upload, anytime I try to fix it I'm confronted with confusing error messages and cryptic things like people telling me "something related to EDR". What the hell is EDR? I just want to view a Word doc.

            I've come to realize that Teams should only be used in large companies who can afford dedicated staff to manage it. But it was certainly sold to us as being easy to use and suitable for a small company.

            • acdha 4 days ago

              EDR: https://en.wikipedia.org/wiki/Endpoint_detection_and_respons...

              I mentioned that because security software blocking things locally or at the network level is such a common source of friction. I don’t think Teams is perfect by any means but the core functionality has been quite stable in personal use, both of my wife’s schools, and my professional use so I wouldn’t conclude that it’s hopeless and always like that.

              • esperent 3 days ago

                Thank you, I appreciate the support. But this doesn't explain the intermittent nature of the issues. For example, just now I tried to open a word file. I got the error message. But then I tried several times and restarted the app twice, and eventually the file did load. It just took five+ minutes of trying over and over.

                I also had to add a new user yesterday, so I went to admin.microsoft.com in Edge. 403 error. Tried Chrome and Firefox. Same. Went back to Edge and suddenly it loaded. The like an idiot I refreshed, 403 error again. Another 5 or six refreshes and it finally loaded again and I was able to add the new user. There's never any real error messages that would help me debug anything, it's just endless frustration and slowness.

          • bubblesnort 4 days ago

            Really it's anyone using teams on older or cheaper hardware.

            • acdha 4 days ago

              So you’ve tested this with clean installs on unfiltered networks? Just how old is your hardware? It works well on, say, the devices they issue students here so I’m guessing it’d have to be extremely old.

      • matrss 4 days ago

        > [...] and it pretty much just works.

        I beg to differ. Last time I had to use PowerPoint (granted, that was ~3 years ago), math on the slides broke when you touched it with a client that wasn't of the same type as the one that initially put it there. So you would need to use either the web app or the desktop app to edit it, but you couldn't switch between them. Since we were working on the slides with multiple people you also never knew what you had to use if someone else wrote that part initially.

        • hu3 4 days ago

          could it be a font issue?

          • matrss 4 days ago

            If I remember correctly I had created the math parts with the windows PowerPoint app and it was shown more or less correctly in the web app, until I double clicked on it and it completely broke; something like it being a singular element that wasn't editable at all when it should have been a longer expression, I don't remember the details. But I am pretty sure it wasn't just a font issue.

      • ezst 4 days ago

        That's the thing, though, the compat story is terrible. I can't say much about the backwards one, but Microsoft has started the process of removing features from the native versions just to lower the bar for the web one catching up. Even my most Microsoft-enamoured colleagues are getting annoyed by this (and the state of all-MS things going downhill, but that's another story)

        • lostlogin 4 days ago

          > That's the thing, though, the compat story is terrible.

          It really is. With shared documents you just have to give up. If someone edits them on the web, in Teams, in the actual app or some other way like on iOS, it all goes to hell.

          Pages get added or removed, images jump about, fonts change and various other horrors occur.

          If you care, you’ll get ground into the earth.

      • tinco 4 days ago

        To be fair, we were impressed with Google Docs 15 years ago. Not saying office.com isn't impressive, but Google Docs certainly isn't impressive today. My company still uses GSuite, as I don't like being in Microsoft's ecosystem and we don't need any advanced features of our office suite but Google Docs and the rest of the GSuite seem to be intentionally held back to technology of the early 2010's.

        • alexanderchr 4 days ago

          Google docs certainly haven't changed much the last 5-10 years. I wonder if that's an intentional choice, or if it is because those that built it and understand how it works are long gone to work on other things.

          • jakub_g 4 days ago

            Actually I did see a few long awaited improvements landing in gdocs lately (e.g. better markdown support, pageless mode).

            I think they didn't deliver much new features in early 2020s because they were busy with a big refactoring from DOM to canvas rendering [0].

            [0] https://news.ycombinator.com/item?id=27129858

          • sexy_seedbox 4 days ago

            No more development? Time for Google to kill Google Docs!

      • fulafel 4 days ago

        What's impressive is that MS has such well trained customers that it can get away with extremely buggy and broken web apps. Fundamental brokenness like collaborative editing frequently losing data and thousand cuts of the more mundane bugs.

      • coliveira 4 days ago

        You must be kidding about "just works". There are so many bugs in word and excel that you could spend the rest of your life fixing. And the performance is disastrous.

      • Cthulhu_ 4 days ago

        > No wonder that took a lot of JavaScript!

        To the point where they quickly found the flaws in JS for large codebases and came up with Typescript. I think. It makes sense that TS came out of the office for web project.

    • inglor 4 days ago

      Hey, I worked with Jonathan on 1JS a while ago (on a team, Excel).

      Just a note OMR (the office monorepo) is a different (and actually much larger) monorepo than 1JS (which is big on its own)

      To be fair I suspect a lot of the bloat in both originates from the amount of home grown tooling.

      • IshKebab 4 days ago

        I thought Microsoft had one monorepo. Isn't that kind of the point? How many do they have?

        • lbriner 4 days ago

          The point of a monorepo is that all the dependencies for a suite of related products are all in a single repo, not that everything your company produces is in a single repo.

          • cjpearson 3 days ago

            Most people use the "suite of related products" definition of monorepo, but some companies like Google and Meta have a single company-wide repository. It's unfortunate that the two distinct strategies have the same name.

    • coliveira 4 days ago

      Teams is the running version of that repository... It is hard for them even to store on git.

  • triyambakam 4 days ago

    > we have folks in Europe that can't even clone the repo due to it's size.

    What is it about Europe that makes it more difficult? That internet in Europe isn't as good? Actually, I have heard that some primary schools in Europe lack internet. My grandson's elementary school in rural California (population <10k) had internet as far back as 1998.

    • _kidlike 4 days ago

      Let's pretend you didn't write the last 2 sentences...

      first of all "internet in Europe" makes close to zero sense to argue about. The article just uses it as a shortcut to not start listing countries.

      I live in a country where I have 10Gbps full-duplex and I pay 50$ / month, in "Europe".

      The issue is that some countries have telecom lobbies which are still milking their copper networks. Then the "competition committees" in most of these countries are actually working AGAINST the benefit of the public, because they don't allow 1 single company to start offering fiber, because that would be a competition advantage. So the whole system is kinda in a deadlock. In order to unblock, at least 2 telecoms have to agree to release fiber deals together. It has happened in some countries.

      • 0points 4 days ago

        What european countries still dont have fiber?

        //Confused swede with 10G fiber all over the place. Writing from literally the countryside next to nowhere.

        • zelphirkalt 4 days ago

          If you really need it pointed out, take it from a German neighbor: Telekom is running some extortion scheme or so here. Oh we could have gotten fiber to our house already ... if we paid them 800+ Euro! So we rather stick with our 100MBits or so connection that is not fiber but copper. If the German state does not intervene here, or the practices of ISPs and whoever has the power to build fiber changes, we will for the foreseeable future still be on copper.

          Then there are villages, which were promised fiber connections, but somehow after switching to the fiber connection made them have unstable Internet and ofter no Internet. Saw some documentary about that, could be fixed by now.

          Putting fiber into the ground also requires a whole lot of effort opening up roads and replacing what's there. Those costs they try to push to the consumers with their 800+ Euro extortion scheme.

          But to be honest, I am also OK with my current connection. All I worry about is it being stable, no package loss, and no ping spikes. A consistently good connection stability is more important than throughout. Sadly, I cannot buy any of those guarantees from any ISP.

          • 0points 2 days ago

            FWIW, Sweden subsidized fiber digging but we still had to pay 2000 EUR to get it connected.

            Government will pay the extra fees, which can easily end up close to 10000 EUR due to large distances.

            If all you need to pay is 800 EUR, then I don't understand what is your issue? Just pay it.

          • singron 4 days ago

            Is 800 euros that bad? In the US, we were quoted $10k a few years back. Even if fiber is already at the road, $800 is probably a fair price just to trench the line from the road to your home and install an entry point. If they provide free installation, then they have to make up the cost by raising your rates.

            • zelphirkalt 3 days ago

              I think private households paying 800 Euro for what should be public infrastructure, being milked by ISPs is pretty bad.

        • holowoodman 4 days ago

          Germany.

          Deutsche Telekom is the former monopoly that was half-privatized around 1995 or something. The state still owns quite a large stake of it.

          They milk their ancient copper crap for everything they can while keeping prices high.

          They are refusing useful backbone interconnects to monopolize access to their customers (Actually they are not allowed to refuse. They just offer interconnections only in their data centers in the middle of nowhere, where you need to rent their (outrageously priced) rackspace and fibres because there is nothing else. They are refusing for decades to do anything useful at the big exchanges like DECIX).

          And if there should ever be a small competitor that on their own tries to lay fibre somewhere, they quickly lay their own fibre into the open ditches (they are allowed to do that) and offer just enough rebates for their former copper customers to switch to their fibre that the competitor cannot recoup the invest and goes bankrupt. Since that dance is now known to everyone, even the announcement of Telekom laying their own fibres kills the competitors' projects there. So after a competitor's announcement of fibre rollout, Telekom does the same, project dead, no fibre rollout at all.

          Oh, and since it is a partially-state-owned former monopoly/ministry, the state and competition authorities turn a blind eye to all that, when not actively promoting them...

          Then there is the problem of "5G reception" vs. "5G reception with usable bandwidth". A lot of overbooking goes on, many cells don't have sufficient capacity allocated, so there are reports of 4G actually being faster in many places.

          And also, yes, you can get 5G in a lot of actually populated areas. But you certainly will pay through the nose for that, usually you get a low-GB amount of traffic included, so maybe a tenth of the Microsoft monorepo in question. The rest is pay-10Eur-per-GB or something.

          • ahartmetz 4 days ago

            It is almost as bad as you say, except that I recently noticed several instances of competitors offering cheaper fiber than Telekom and surviving. Still, overall fiber buildout is low, like... I looked it up, reportedly 36% now.

          • immibis 4 days ago

            Wait, I live in that area. Does that mean I'm allowed to lay my own fiber into their open ditches too, or do they have special rights no one else has?

            • holowoodman 3 days ago

              Afaik the special right is granted to everyone providing fibre services to the public to be informed about any ditches on public ground being dug and getting the opportunity to throw their fibre in before the ditch is closed again.

        • SSLy 4 days ago

          Germany, GP's situation smells like their policies.

        • ahoka 4 days ago

          I pay 42USD for 250Mbit in a larger Swedish city. What is that magic ISP I should be using?

          • 0points 19 hours ago

            Change landlord. I used to pay about 100 SEK for bahnhof in svenska bostäder before I moved away. It came with public IP and everything.

          • BenjiWiebe 3 days ago

            Sounds like you are already using a magic ISP (rural USA here).

    • yashap 4 days ago

      They’re probably downloading from a server in the states, being much further away makes a big difference with a massive download.

    • nyanpasu64 4 days ago

      I've experienced interruptions mid-clone (with no apparent way to resume them) when trying to clone repos on unreliable connections, and perhaps a similar issue is happening with connections between continents.

      • joshvm 4 days ago

        The only reliable route I’ve found is to use SSH clone. HTTPS is lousy and as you mention, is not resumable. Works fine in Antarctica even over our slower satellite. Doesn’t help if you actually drop, but you can clone to a remote and then rsync everything over time.

    • p_l 4 days ago

      It's issues cloning super huge repo over crappy protocols across ocean especially when VPNs get included in the problem

    • 59nadir 4 days ago

      Most european countries have connections with more bandwith and less base latency for cheaper than the US, it's not a connection issue. If there was an issue it's that the repo itself is hosted on the other side of the world, but even so the sidenote itself is odd.

      • tom_ 4 days ago

        I wouldn't say it's odd at all - it's basically what's justifying actually trying to solve the problem rather than just going "huh... that's weird..." then putting it on the backlog due to it not being a showstopper.

        This sort of thing has been a problem on every project I've worked on that's involved people in America. (I'm in the UK.) Throughput is inconsistent, latency is inconsistent, and long-running downloads aren't reliable. Perhaps I'm over-simplifying, but I always figured the problem was fairly obvious: it's a lot of miles from America to Europe, west coast America especially, and a lot of them are underwater, and your're sharing the conduit with everybody else in Europe. Many ways for packets to get lost (or get held up long enough to count), and frankly it's quite surprising more of them don't.

        (Usual thing for Perforce is to leave it running overnight/weekend with a retry count of 1 million. I'm not sure what you'd do with Git, though? it seems to do the whole transfer as one big non-retryable lump. There must be something though.)

    • gnrlst 4 days ago

      In most EU countries we have multi-gigabit internet (for cheap too). Current offers are around ~5 GBIT speeds for 20 bucks a month.

      • jillesvangurp 4 days ago

        Sadly, I'm in Germany. Which is a third world country when it comes to decent connectivity. They are rolling out some fiber now in Berlin. Finally. But very slowly and not to my building any time soon. Most of the country is limited to DSL speeds. Mobile coverage is getting better but still non existent outside of cities. Germany has borders with nine countries. Each of those have better connectivity than Germany.

        I'm from the Netherlands where over 90% of households now have fiber connections, for example. Here in Berlin it's very hard to get that. They are starting to roll it out in some areas but it's taking very long and each building has to then get connected, which is up to the building owners.

        • aniviacat 4 days ago

          > Mobile coverage is getting better but still non existent outside of cities.

          According to the Bundesnetzagentur over 90% [1] of Germany has 5G coverage (and almost all of the rest has 4G [2]).

          [1] https://www.bundesnetzagentur.de/SharedDocs/Pressemitteilung...

          [2] https://gigabitgrundbuch.bund.de/GIGA/DE/MobilfunkMonitoring...

          • holowoodman 4 days ago

            Those statistics are a half-truth at best.

            The "coverage" they are reporting is not by area but by population. So all the villages and fields that the train or autobahn goes by won't have 5G, because they are in the other 10% because of their very low population density.

            And the reporting comes out of the mobile phone operators' reports and simulations (they don't have to do actual measurements). Since their license depends on meeting a coverage goal, massive over-reporting is rampant. The biggest provider (Deutsche Telekom) is also partially state-owned, so the regulators don't look as closely...

            Edit: accidentially posted this in the wrong comment: Then there is the problem of "5G reception" vs. "5G reception with usable bandwidth". A lot of overbooking goes on, many cells don't have sufficient capacity allocated, so there are reports of 4G actually being faster in many places.

            And also, yes, you can get 5G in a lot of actually populated areas. But you certainly will pay through the nose for that, usually you get a low-GB amount of traffic included, so maybe a tenth of the Microsoft monorepo in question. The rest is pay-10Eur-per-GB or something.

          • jillesvangurp 4 days ago

            I usually lose connectivity on train journeys across Germany. I'm offline most of the way. Even the in train wifi gets quite bad in remote areas. Because they depend on the same shitty mobile networks. There's a stark difference as soon as you cross the borders with other countries. Suddenly stuff works again. Things stop timing out.

            I also deal with commercial customers that have companies in areas with either no or poor mobile connectivity and since we sell mobile apps to them, we always need to double check they actually have a good connection. One of our customers is on the edge of a city with very spotty 4G at best. I recently recommended Star Link to another company that is operating in rural areas. They were asking about offline capabilities of our app. Because they deal with poor connectivity all the time. I made the point that you can get internet anywhere you want now for a fairly reasonable price.

        • barrkel 4 days ago

          When I travel in Germany I use a Deutsche Telekom pay as you go SIM in a 5G hotspot, and generally get about 200Mbit throughtput, which is far higher than you can expect any place you're staying to provide. It's €7 a day (or €100 a month) but it's worth it to avoid the terrible internet.

          • zelphirkalt 4 days ago

            Oh, that is an incentive for them not to improve anything. Wouldn't want customers to stop purchasing mobile Internet for 100 Euro a month.

      • n_ary 4 days ago

        Well good for you. On my side of europe, I pay €50/- for a cheap 50Mbps(1 month cancellation notice period). I could get a slightly cheaper 100Mbps from a predator for €20/- for first 6 month but then it goes up to €50/- and they pull bs about not being able to cancel if you even move because your new location is also in their coverage area(over garbage copper) and suffers at least 20 outages per month while there are other providers with much cheaper rates and better service.

        Some EU is still suffering from Telekom copper barons.

      • badgersnake 4 days ago

        Not in the UK. Still on 80Mbit VDSL here.

        • _joel 4 days ago

          You must be unlucky, according to Openreach "fibre broadband is already available in more than 96.59 per cent of the UK."

          • mattlondon 4 days ago

            Is that "fibre" or "full fibre".

            They lied a lot for a good few years saying "OMG fibre broadband!" When in reality is was still copper for the last mile so that "fibre" connection in reality was some ADSL variant and limited to 80/20mpbs.

            Actual full fibre all the way from your home to the internet is I think still quite a way behind. Even in London (London! The capital city with high density) there are places where there are no full fibre options.

            • Deathmax 4 days ago

              According to ThinkBroadband's tracking [1], the headline figures are 85.20% of premises are gigabit capable (FTTP/FTTH/Cable [DOCSIS]) with 71.86% being full fibre.

              [1]: https://www.thinkbroadband.com/news/10343-85-gigabit-coverag...

            • _joel 4 days ago

              Maybe myself and my friends are lucky as we're all on ftth

              • mattlondon 4 days ago

                Only a few I know are on ftth. I guess I live in a fairly affluent area in Zone 3 which is lower density than average - zero flats etc, all just individual houses so perhaps not worth their effort rolling out

          • badgersnake 4 days ago

            Coming next year apparently. I won’t hold my breath.

        • sirsinsalot 4 days ago

          I and many I know have Gb fiber in the UK

    • RadiozRadioz 4 days ago

      At least here in Western Europe, in general the internet is great. Though coverage in rural areas varies by country.

    • johnisgood 4 days ago

      Some countries in Europe (even Poland) definitely offer faster Internet and for cheaper than the US, and without most of the privacy issues that US ISPs have.

    • mattlondon 4 days ago

      I was not sure what this meant either. I know personally I have downloaded and uploaded some very very large files transatlantic (e.g. syncing to cloud storage) with absolutely no issues, so not sure what they are talking about. I guess perhaps there are issues with git cloning such a large amount of data, but that is a problem with git and not the infrastructure.

      FWIW every school I've seen (and I recently toured a bunch looking at them for my kids to start at) all had the internet and the kids were using iPads etc for various things.

      Anecdotally my secondary school (11-18y in UK) in rural Hertfordshire was online in the 1995 region. It was via I think a 14.4 modem and there actually wasn't that much useful material for kids then to be honest. I remember looking at the "non-professional style" NASA website for instance (the current one is obviously quite fancy in comparison, but it used to be very rustic and at some obscure domain). CD-based encyclopedias we're all the rage instead around that time IIRC - Encarta et al.

    • heisenbit 4 days ago

      Effective bandwidth can be influenced by roundtrip time. Fewer IP4 numbers means more NAT with more delay and yet another point where occasionally something can go wrong. Last but not least there are some areas in the EU like the Canary Islands where the internet feels like going over a sat.

    • nemetroid 4 days ago

      The problem is probably that the repo is not hosted in Europe.

    • o11c 4 days ago

      My knowledge is a bit outdated, but we used to say:

      * in America, peering between ISPs is great, but the last-mile connection is terrible

      * In Europe, the last-mile connection is great, but peering between the ISPs is terrible (ISPs are at war with each other). Often you could massively improve performance by renting a VPS in the correct city and routing your traffic manually.

    • teo_zero 4 days ago

      > > we have folks in Europe that can't even clone the repo due to it's size.

      > I have heard that some primary schools in Europe lack internet.

      Maybe they lack internet but teach their pupils how to write "its".

  • rettichschnidi 4 days ago

    I'm surprised they are actually using Azure DevOps internally. Creating your own hell I guess.

    • jonathanlydall 4 days ago

      I find the “Boards” part of DevOps doesn’t work well for us a small org wanting a less structured backlog, but for components like Pipelines and the Git repositories it’s neither here nor there for us.

      What aspects of Azure DevOps are hell to you?

      • rettichschnidi 4 days ago

        Some examples, in no particular order.

        Hampering the productivity:

        - Review messages get sent out before review is actually finished. It should be sent out only once the reviewer has finished the work.

        - Code reviews are implemented in a terrible way compared to GitHub or GitLab.

          - Re-requesting a review once you did implemented proposed changes? Takes a single click on GitHub, but can not be done in Azure DevOps. I need to e.g. send a Slack message to the reviewer or remove and re-add them as reviewer.
        
          - Knowing to what line of code a reviewer was giving feedback to? Not possible after the PR got updated, because the feedback of the reviewer sticks to the original line number, which might now contain something entirely different.
        
        - Reviewing the commit messages in a PR takes way too many clicks. This causes people to not review the commit messages, letting bad commit messages pass and thus making it harder for future developers trying to figure out why something got implemented the way it did. Examples:

          - Too many clicks to review a commit message: PR -> Commits -> Commit -> Details
        
          - Comments on a specific commit does not shown in the commits PR
        
        - Unreliable servers. E.g. "remote: TF401035: The object '<snip>' does not exist.\nfatal: the remote end hung up unexpectedly" happens too often on git fetch. Usually works on a 2nd try.

        - Interprets IPv6 addresses in commit messages as emoji. E.g. fc00::6:100:0:0 becomes fc00::60:0.

        - Can not cancel a stage before it actually has started (Wasting time, cycles)

        - Terrible diffs (can not give a public example)

        - Network issues. E.g. checkouts that should take a few seconds take 15+ minutes (can not give a public example)

        - Step "checkout": Changes working folder for following steps (shitty docs, shitty behaviour)

        - The documentation reads as if their creators get paid by the number of words, but not for actually being useful. Whereas GitHub for example has actually useful documentation.

        - PR are always "Show everything", instead of "Active comments" (what I want). Resets itself on every reload.

        - Tabs are hardcoded (?) to be displayed as 4 chars - but we want 8 (Zephyr)

        - Re-running a pipeline run (manually) does not retain the resources selected in the last run

        Security:

        - DevOps does not support modern SSH keys, one has to use RSA keys (https://developercommunity.visualstudio.com/t/support-non-rs...). It took them multiple years to allow RSA keys which are not deprecated by OpenSSH due to security concerns (https://devblogs.microsoft.com/devops/ssh-rsa-deprecation/), yet no support for modern algos. This also rules out the usage of hardware tokens, e.g. YubiKeys.

        Azure DevOps is dying. Thus, things will not get better:

        - New, useful features get implemented by Microsoft for GitHub, but not for DevOps. E.g. https://devblogs.microsoft.com/devops/static-web-app-pr-work...

        - "Nearly everyone who works on AzDevOps today became a GitHub employee last year or was hired directly by GitHub since then." (Reddit, https://www.reddit.com/r/azuredevops/comments/nvyuvp/comment...)

        - Looking at Azure DevOps Released Features (https://learn.microsoft.com/en-us/azure/devops/release-notes...) it is quite obvious how much things have slowed down since e.g. 2019.

        Lastly - their support is ridiculously bad.

    • sshine 4 days ago

      > I'm surprised they are actually using Azure DevOps internally. Creating your own hell I guess.

      Even the hounds of hell may benefit from dogfooding.

      • tazjin 4 days ago

        houndfooding?

        • sshine 3 days ago

          Ain't nothing but a hound dog.

  • nixosbestos 3 days ago

    Oh hey I know that name, Stolee. Fellow JSR grad here.

  • jbverschoor 4 days ago

    > those branches that only change CHANGELOG.md and CHANGELOG.json, we were fetching 125GB of extra git data?! HOW THO??

    Unrecognized 100x programmer somewhere lol

  • mattlondon 4 days ago

    I recently had a similar moment of WTF for git in a JavaScript repo.

    Much much smaller of course though. A raspberry pi had died and I was trying to recover some projects that had not been pushed to GitHub for a while.

    Holy crap. A few small JavaScript projects with perhaps 20 or 30 code files, a few thousand lines of code for a couple of 10s of KBs of actual code at most had 10s of gigabytes of data in the .git/ folder. Insane.

    In the end I killed the recovery of the entire home dir and had to manually select folders to avoid accidentally trying to recover a .git/ dir as it was taking forever on a poorly SD card that was already in a bad way and I did not want to finally kill it for good by trying to salvage countless gigabytes of trash for git.

  • nsonha 3 days ago

    I think the title misses the "Honey, " part

  • Vilian 3 days ago

    People who use git in monorepos don't understand git

  • EDEdDNEdDYFaN 4 days ago

    better question - does the changelog need to be checked in the first place?

    • DeathMetal3000 4 days ago

      They fixed a bug on a tool that is widely used. In what world is questioning why an organization is checking in a file that you have no context on a “better question”.

  • jakub_g 4 days ago

    Paraphrasing meat of the article:

    - When you have multiple files in the repo which have the same trailing 16 characters in the repo path, git may wrongly calculate deltas, mixing up between those files. In here they had multiple CHANGELOG.md files mixed up.

    - So if those files are big and change often, you end up with massive deltas and inflated repo size.

    - There's a new git option (in Microsoft git fork for now) and config to use full file path to calculate those deltas, which fixes the issue when pushing, and locally repacking the repo.

    ```

    git repack -adf --path-walk

    git config --global pack.usePathWalk true

    ```

    - According to a screenshot, Chromium repacked in this way shrinks from 100GB to 22GB.

    - However AFAIU until GitHub enables it by default, GitHub clones from such repos will still be inflated.

    • kreetx 4 days ago

      I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

      Also, thank you for the TLDR!

      • masklinn 4 days ago

        > I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

        Fixing an existing repository requires a full repack, and for a repository as big as Chromium it still takes more than half a day (56000 seconds is 15h30), even if that's an improvement over the previous 3 days it's a lot of compute.

        From my experience of previous attempts, trying to get Github to run a full repack with harsh settings is extremely difficult (possibly because their infrastructure relies on more loosely packed repositories), I tried to get that for $dayjob's primary repository whose initial checkout had gotten pretty large and got nowhere.

        As of right now, said repository is ~9.5GB on disk on initial clone (full, not partial, excluding working copy). Locally running `repack -adf --window 250` brings it down to ~1.5GB, at the cost of a few hours of CPU.

        The repository does have some of the attributes described in TFA, so I'm definitely looking forward to trying these changes out.

        • leksak 4 days ago

          Wouldn't a potential workaround be to create a new barebones repository and push the repacked one there? Sure, people will have to change their remote origin but if it solves the problem that might be worth the hassle?

          • masklinn 4 days ago

            It breaks the issues, PRs, all the tooling and integration, …

            For now we’re getting by with partial clones, and employee machines being imaged with a decently up to date repository.

    • deskr 4 days ago

      > in Microsoft git fork for now

      Wait, what? Has MS forked git?

      • jakub_g 3 days ago

        MS has had their fork of git for years, and they contributed many performance features for monorepos since then to the mainline.

      • keybored a day ago

        Companies fork Git in order to work on things internally until they ready to be proposed for inclusion into Git itself. I’m pretty sure that GitHub and GitLab (and?) do the same thing.

        These are not forks-going-their-own-way forks.

    • jamalaramala 4 days ago

      Thank you to the AI that summarised the article. ;-)

  • jimjimjim 4 days ago

    Did anybody else shudder at "Shrunked"?

    • tankenmate 4 days ago

      Shrunken, shrunked ain't no language I ever heard of.

    • amsterdorn 4 days ago

      Honey, I done shrunked them kids

    • 0points 4 days ago

      English is my third language, also yes.

  • killingtime74 4 days ago

    Shrank

    • tankenmate 4 days ago

      Would be correct if it is "We shrank", but from my poor memory of the terminology that is the transitive form, shrunken is the intransitive form. But once again from my poor memory.

    • darraghenright 4 days ago

      I've spoken English as my native language for almost five decades and I've never seen/heard the word "shranked" before.

      This surely cannot be correct. Even the title of the linked article doesn't use "shranked". What?

      • forgotpwd16 4 days ago

        Commonly (since ca. 19th century), shrank is used as the past tense of shrink, shrunk as the past particle, and shrunken as an adjective. The title of the linked article uses "shrunk" as past tense and the submitted title was changed to "shrunked" for some reason. "Shranked" was not mentioned anywhere. (But "shrinked" has had some use in the past.)

    • peutetre 4 days ago

      I was in the pool!

    • dougthesnails 4 days ago

      I think I prefer shrunked in this context.

    • Sparkyte 4 days ago

      Shrinky dinky

  • AbuAssar 4 days ago

    the gif memes were very distracting...