How We Found 7 TiB of Memory Just Sitting Around

(render.com)

102 points | by anurag a day ago ago

22 comments

  • shanemhansen a day ago

    The unreasonable effectiveness of profiling and digging deep strikes again.

    • hinkley 5 hours ago

      The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

      There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.

      • Negitivefrags 4 hours ago

        I think the biggest tool is higher expectations. Most programmers really haven't come to grips with the idea that computers are fast.

        If you see a database query that takes 1 hour to run, and only touches a few gb of data, you should be thinking "Well nvme bandwidth is multiple gigabytes per second, why can't it run in 1 second or less?"

        The idea that anyone would accept a request to a website taking longer than 30ms, (the time it takes for a game to render it's entire world including both the CPU and GPU parts at 60fps) is insane, and nobody should really accept it, but we commonly do.

        • azornathogron 3 hours ago

          Pedantic nit: At 60 fps the per frame time is 16.66... ms, not 30 ms. Having said that a lot of games run at 30 fps, or run different parts of their logic at different frequencies, or do other tricks that mean there isn't exactly one FPS rate that the thing is running at.

          • Negitivefrags 2 hours ago

            The CPU part happens on one frame, the GPU part happens on the next frame. If you want to talk about the total time for a game to render a frame, it needs to count two frames.

            • wizzwizz4 an hour ago

              Computers are fast. Why do you accept a frame of lag? The average game for a PC from the 1980s ran with less lag than that. Super Mario Bros had less than a frame between controller input and character movement on the screen. (Technically, it could be more than a frame, but only if there were enough objects in play that the processor couldn't handle all the physics updates in time and missed the v-blank interval.)

              • Negitivefrags an hour ago

                If Vsync is on which was my assumption from my previous comment, then if your computer is fast enough, you might be able to run CPU and GPU work entirely in a single frame if you use Reflex to delay when simulation starts to lower latency, but regardless, you still have a total time budget of 1/30th of a second to do all your combined CPU and GPU work to get to 60fps.

        • hinkley 2 hours ago

          Lowered expectations are come in part from people giving up on theirs. Accepting versus pushing back.

          • antonymoose 2 hours ago

            I have high hopes and expectations, unfortunately my chain of command does not, and is often an immovable force.

            • hinkley an hour ago

              This is a terrible time to tell someone to find a movable object in another part of the org or elsewhere. :/

              I always liked Shaw’s “The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.”

        • javier2 3 hours ago

          its also about cost. My game computer has 8 cores + 1 expensive gpu + 32GB ram for me alone. We dont have that per customer.

          • oivey 2 hours ago

            This is again a problem understanding that computers are fast. A toaster can run an old 3D game like Quake at hundreds of FPS. A website primarily displaying text should be way faster. The reasons websites often aren’t have nothing to do with the user’s computer.

            • paulryanrogers 2 hours ago

              That's a dedicated toaster serving only one client. Websites usually aren't backed by bare metal per visitor.

              • oivey an hour ago

                Right. I’m replying to someone talking about their personal computer.

          • avidiax 3 hours ago

            It's also about revenue.

            Uber could run the complete global rider/driver flow from a single server.

            It doesn't, in part because all of those individual trips earn $1 or more each, so it's perfectly acceptable to the business to be more more inefficient and use hundreds of servers for this task.

            Similarly, a small website taking 150ms to render the page only matters if the lost productivity costs less than the engineering time to fix it, and even then, only makes sense if that engineering time isn't more productively used to add features or reliability.

      • zahlman 4 hours ago

        > The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

        The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.

        > There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried.

        It's hard for profilers to identify slowdowns that are due to the architecture. Making the function do less work to get its result feels different from determining that the function's result is unnecessary.

        • hinkley 2 hours ago

          Architecture, cache eviction, memory bandwidth, thermal throttling.

          All of which have gotten perhaps an order of magnitude worse in the time since I started on this theory.

      • jesse__ 3 hours ago

        Broadly agree.

        I'm curious, what're the profilers you know of that tried to be better? I have a little homebrew game engine with an integrated profiler that I'm always looking for ideas to make more effective.

        • hinkley 2 hours ago

          Clinic.js tried and lost steam. I have a recollection of a profiler called JProfiler that represented space and time as a graph, but also a recollection they went under. And there is a company selling a product of that name that has been around since that time, but doesn’t quite look how I recalled and so I don’t know if I was mistaken about their demise or I’ve swapped product names in my brain. It was 20 years ago which is a long time for mush to happen.

          The common element between attempts is new visualizations. And like drawing a projection of an object in a mechanical engineering drawing, there is no one projection that contains the entire description of the problem. You need to present several and let brain synthesize the data missing in each individual projection into an accurate model.

  • nitinreddy88 a day ago

    The other way to look is why adding NS label is causing so much memory footprint in Kubernetes. Shouldn't be fixing that (could be much bigger design change), will benefit whole Kube community?

    • bstack 10 hours ago

      Author here: yeah that's a good point. tbh I was mostly unfamiliar with Vector so I took the shortest path to the goal but that could be interesting followup. It does seem like there's a lot of bytes per namespace!

  • hinkley 4 hours ago

    Keys require O(logn) space per key or nlogn for the entire data set, simply to avoid key collisions. But human friendly key spaces grow much, much faster and I don’t think many people have looked too hard at that.

    There were recent changes to the NodeJS Prometheus client that eliminates tag names from the keys used for storing the tag cardinality for metrics. The memory savings wasn’t reported but the cpu savings for recording data points was over 1/3. And about twice that when applied to the aggregation logic.

    Lookups are rarely O(1), even in hash tables.

    I wonder if there’s a general solution for keeping names concise without triggering transposition or reading comprehension errors. And what the space complexity is of such an algorithm.