Docker Systems Status: Full Service Disruption

(dockerstatus.com)

193 points | by l2dy 5 hours ago ago

64 comments

  • phillebaba 3 hours ago

    Shameless plug but this might be a good time to install Spegel in your Kubernetes clusters if you have critical dependencies on Docker Hub.

    https://spegel.dev/

    • osivertsson 2 hours ago

      If it really is fully open-source please make that more visible on your landing page.

      It is a huge deal if I can start investigating and deploying such a solution as a techie right away, compared to having to go through all the internal hoops for a software purchase.

      • CaptainOfCoit 2 hours ago

        How hard is it to go to the GitHub repository and open the LICENSE file that is in almost every repository? Would have taken you less time than writing that comment, and showed you it's under MIT.

        • rplnt an hour ago

          It's not entirely uncommon to only have parts of the solution open. So a license on one repo might not be the whole story and looking further would take more time than giving a good suggestion to the author.

        • kelvinjps10 32 minutes ago

          Also it's good feedback for the developer of this solution

      • mocko 2 hours ago
    • mike-cardwell 21 minutes ago

      This looks good, but we're using GKE and it looks like it only works there with some hacks. Is there a timeline to make it work with GKE properly?

    • storm1er an hour ago

      What's the difference with kuik? Spegel seems too complicated for my homelab, but could be a nice upgrade for my company

      Kuik: https://github.com/enix/kube-image-keeper?tab=readme-ov-file...

      • phillebaba 40 minutes ago

        It's been a while since I looked at kuik, but I would say the main difference is that Spegel doesn't do any of the pulling or storage of images. Instead it relies on Containerd to do it for you. This also means that Spegel does not have to manage garbage collection. The nice thing with this is that it doesn't change how images are initially pulled from upstream and is able to serve images that exist on the node before Spegel runs.

        Also it looks kuik uses CRDs to store information about where images are cached, while Spegel uses its own p2p solution to do the routing of traffic between nodes.

        If you are running k3s in your homelab you can enable Spegel with a flag as it is an embedded feature.

    • CaptainOfCoit 2 hours ago

      There is a couple of alternatives that mirrors more than just Docker Hub too, most of them pretty bloated and enterprisey, but they do what they say on the tin and saved me more than once. Artifactory, Nexus Repository, Cloudsmith and ProGet are some of them.

      • phillebaba 2 hours ago

        Spegel does not only mirror Docker Hub, and works a lot differently than the alternatives you suggested. Instead of being yet another failure point closer to your production environment, it runs a distributed stateless registry inside of your Kubernetes cluster. By piggy backing off of Containerds image store it will distribute already pulled images inside of the cluster.

        • CaptainOfCoit 2 hours ago

          I'll be honest and say I hadn't heard of Spegel before, and just read the landing page which says "Speed up container pulls and minimize downtime with a stateless peer-to-peer OCI registry mirror for efficient image distribution", so it isn't exactly clear you can use it for more things than container images.

  • darkamaul 12 minutes ago

    For other people impacted, what helped me this morning was to use the `ghcr`, albeit this is not a one-to-one replacement.

    Ex: `docker pull ghcr.io/linuxcontainers/debian-slim:latest`

  • atymic 5 hours ago
    • reader_1000 2 hours ago

      > We have identified the underlying issue with one of our cloud service providers.

      Isn't it everyone using multiple cloud providers nowadays? Why are they affected by single cloud provider outage?

      • lvncelot 2 hours ago

        I think more often than not, companies are using a single cloud provider, and even when multiple are used, it's either different projects with different legacy decisions or a conscious migration.

        True multi-tenancy is not only very rare, it's an absolute pain to manage as soon as people start using any vendor-specific functionality.

      • rcxdude 2 hours ago

        Because it's hard enough to distribute a service across multiple machines in the same DC, let alone across multiple DCs and multiple providers.

      • postexitus 2 hours ago

        Not only they are not using multiple cloud providers, they are not using multiple cloud locations.

      • madisp 40 minutes ago

        they are using multiple cloud providers, but judging by the cloudflare r2 outage affecting them earlier this year I guess all of them are on the critical path?

      • nobleach an hour ago

        Looking at the landscape around me, no. Everyone is in crisis cost-cutting, "gotta show that same growth the C-suite saw during Covid" mode. So being multi-provider, and even in some cases, being multi-regional, is now off the table. It's sad because the product really suffers. But hey, "growth".

  • ic4l 4 hours ago

    This broke our builds since we rely on several public Docker images, and by default, Docker uses docker.io.

    Thankfully, AWS provides a docker.io mirror for those who can't wait:

      FROM public.ecr.aws/docker/library/{image_name}
    
    In the error logs, the issue was mostly related to the authentication endpoint:

    https://auth.docker.io → "No server is available to handle this request"

    After switching to the AWS mirror, everything built successfully without any issues.

    • CamouflagedKiwi 3 hours ago

      Mild irony that Docker is down because of the AWS outage, but the AWS mirror repos are still running...

    • geostyx 2 hours ago

      public.ecr.aws was failing for me earlier with 5XX errors due to the AWS outage: https://news.ycombinator.com/item?id=45640754

    • firloop 3 hours ago

      I wasn't able to get this working, but I was able to use Google's mirror[0] just fine.

      Just had to change

          FROM {image_name}
      
      to

          FROM mirror.gcr.io/{image_name} 
      
      Hope this helps!

      [0]: https://cloud.google.com/artifact-registry/docs/pull-cached-...

      • ic4l 3 hours ago

        We tried this initially

          FROM mirror.gcr.io/{image_name}
        
        We received

          failed to resolve source metadata for mirror.gcr.io/
        
        So it looks like these services may not be true mirrors, and just functioning as a library proxy with a cache.

        If you're image is not cached on one of these then you may be SOL.

        • da768 2 hours ago

          During the last Docker Hub outage we found Google mirrors lost all image tags after a while. Image digest references would probably work

  • helpfulmandrill 3 hours ago

    I wonder if this is why I also can't log in to O'Reilly to do some "Docker is down, better find something to do" training...

    • p0w3n3d an hour ago

      Just install a pull-through proxy that will store all the packages recently used.

  • KronisLV 4 hours ago

    I guess people who are running their own registries like Nexus and build their own container images from a common base image are feeling at least a bit more secure in their choice right now.

    Wonder how many builds or redeployments this will break. Personally, nothing against Docker or Docker Hub of course, I find them to be useful.

    • tom1337 2 hours ago

      We are using base images but unfortunately some github actions are pulling docker images in their prepare phase - so while my application would build, I cannot deploy it because the CI/CD depends on dockerhub and you cannot change where these images are pulled from (so they cannot go through a pull-through cache)…

    • Sphax 4 hours ago

      We run Harbor and mirror every base image using its Proxy Cache feature, it's quite nice. We've had this setup for years now and while it works fine, Harbor has some rough edges.

      • thephyber 2 hours ago

        I came here to mention that any non-trivial company depending on Docker images should look into a local proxy cache. It’s too much infra for a solo developer / tiny organization, but is a good hedge against DockerHub, GitHub repo, etc downtime and can run faster (less ingress transfer) if located in the same region as the rest of your infra.

    • frenkel 4 hours ago

      Only if they get their base images from somewhere else...

      • bravetraveler 3 hours ago

        Pull-through caches are still useful even when the upstream is down... assuming the image(s) were pulled recently. The HEAD to upstream will obviously fail [when checking currency], but the software is happy to serve what it has already pulled.

        Depends on the implementation, of course: I'm speaking to 'distribution/distribution', the reference. Harbor or whatever else may behave differently, I have no idea.

    • nusl 3 hours ago

      Currently unable to do much of anything new in dev/prod environments without manual workarounds. I'd imagine the impact is pretty massive.

      Asside; seems Signal is also having issues. Damn.

      • cebert 3 hours ago

        I’m not sure that the impact will be that big. Most organizations have their own mirrors for artifacts.

        • VenturingVole 3 hours ago

          From what I've seen: I highly doubt it.

          Edit to add: This might spur on a few more to start doing that, but people are quick to forget/prioritise other areas. If this keeps happening then it will change.

        • nusl 3 hours ago

          Yeah, perhaps. I don't know how many folks host mirrors. Most places I've worked for didn't, though this is anecdotal.

          • phillebaba 3 hours ago

            I would say most people would say it't best practice while a minority actually does it.

            • CaptainOfCoit 2 hours ago

              Seems related to size and/or maturity if anything. I haven't seen any startups less than five year old doing anything like that, but I also haven't seen any huge enterprise not doing that, YMMV.

    • jsmeaton 3 hours ago

      Guess where we host nexus..

  • l2dy 2 hours ago

    Recovering as of October 20, 2025 09:43 UTC

    > [Monitoring] We are seeing error rates recovering across our SaaS services. We continue to monitor as we process our backlog.

  • dd_xplore 3 hours ago

    Does it decrease the AWS's nine 9s ?

    • speedgoose 2 hours ago

      The marketing department did the maths and they said no.

      • nobleach an hour ago

        "MOST of the time" we're nine 9s.

  • jdthedisciple 4 hours ago

    So thus far today outages are reported from

    - AWS

    - Vercel

    - Atlassian

    - Cloudflare

    - Docker

    - Google (see downdetector)

    - Microsoft (see downdetector)

    What's going on?

    • d4rkp4ttern an hour ago

      Reddit appears to be only semi operational. Frequent “rate limit” errors and empty pares while just browsing. Not sure if related

    • ta1243 4 hours ago

      Or they all rely on AWS, because over the last 15 years we've built an extremely fragile interconnected global system in the pursuit of profit, austerity, and efficiency

      • benrutter 3 hours ago

        Wait, Google and Microsoft rely on AWS? That seems unlikely? (does it? I wouldn't really know to be honest)

        • ssl-3 3 hours ago

          In terms of user reports: Some users don't know what the hell is going on. This is a constant.

          For instance: When there's a widespread Verizon cellular outage, sites like downdetector will show a spike in Verizon reports.

          But such sites will also show a spike in AT&T and T-Mobile reports. Even though those latter networks are completely unaffected by Verizon's back-end issues, the graphs of user reports are consistently shaped the same for all 3 carriers.

          This is just because some of the users doing the reporting have no clue.

          So when the observation is "AWS is in outage and people are reporting issues at Google, and Microsoft," then the last two are often just factors of people being people and reporting the wrong thing.

          (You're hanging out on HN, so there's very good certainty that you know what precisely what cell carrier you're using and also can discern the difference betwixt an Amazon, a Google, and a Microsoft. But lots of other people are not particularly adept at making these distinctions. It's normal and expected for some of them to be this way at all times.)

        • thephyber 2 hours ago

          It’s very likely they’ve bought companies that were built on AWS and haven’t migrated to use their homegrown cloud platforms.

        • ta1243 3 hours ago

          More likely the outage reports for google and microsoft are based around systems which also include aws

        • CrayKhoi 3 hours ago

          They might be using third party services that rely on AWS.

    • throw-10-13 2 hours ago

      dns outage at aws exposing how overly centralized our infra is

  • sschueller 3 hours ago

    What are good proxy/mirror solutions to mitigate such issues? Best would be an all in one solution that for example also handles nodejs, packigist etc.

    • bravetraveler 3 hours ago

      Pulp is a popular project for 'one stop shop', I believe. Personally, always used project-specific solutions like 'distribution/distribution' for containers from the CNCF. This allows for pull-through caching with relatively little setup work.

  • conradfr 3 hours ago

    Is there a built-in way to bypass the request to the registry if your base layers are cached?

  • Zekio 2 hours ago

    what good options are there for container registry proxies / caches to protect against something like this?

  • danvesma 4 hours ago

    ...well this explains a lot about how my morning is going...

  • wolfgangbabad 3 hours ago
  • wolfgangbabad 3 hours ago

    even reddit throws a lot of 503s when adding/editing comments

    • throw-10-13 2 hours ago

      reddit is always going down, thats the least surprising thing about this