A 40-line fix eliminated a 400x performance gap

(questdb.com)

268 points | by bluestreak 12 hours ago ago

56 comments

  • ot 10 hours ago

    You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.

    This is not well documented unfortunately, and I'm not aware of open-source implementations of this.

    EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.

    • jerrinot 10 hours ago

      That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!

      • ot 10 hours ago

        Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)

        • catlifeonmars 3 hours ago

          I guess if you need the concurrency/throughput you should use a userspace green thread implementation. I’m guessing most implementations of green threads multiplex onto long running os threads anyway

          • jerrinot 2 hours ago

            In a system with green threads, you typically want the CPU time of the fiber or tasklet rather than the carrier thread. In that case, you have to ask the scheduler, not the kernel.

    • nly 2 hours ago

      Why do you need a seqlock? To make sure you're not context switched out between the read of the page value and the rdtsc?

      Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?

      Tbh I thought clock_gettime was a vdso based virtual syscall anyway

  • shermantanktop 9 hours ago

    Flamegraphs are wonderful.

    Me: looks at my code. "sure, ok, looks alright."

    Me: looks at the resulting flamegraph. "what the hell is this?!?!?"

    I've found all kinds of crazy stuff in codebases this way. Static initializers that aren't static, one-line logger calls that trigger expensive serialization, heavy string-parsing calls that don't memoize patterns, etc. Unfortunately some of those are my fault.

    • wging 8 hours ago

      I also like icicle graphs for this. They're flamegraphs, but aggregated in the reverse order. (I.e. if you have calls A->B->C and D->E->C, then both calls to C are aggregated together, rather than being stacked on top of B and E respectively. It can make it easier to see what's wrong when you have a bunch of distinct codepaths that all invoke a common library where you're spending too much time.)

      Regular flamegraphs are good too, icicle graphs are just another tool in the toolbox.

      • pests 4 hours ago

        So someone else linked the original flamegraph site [0] and it describes icicle graphs as "inverting the y axis" but that's not only what's happening, right? You bucket top-down the stack opposed to bottom-up, correct?

        [0] https://www.brendangregg.com/flamegraphs.html

    • tempaccsoz5 7 hours ago

      Also cool that when you open it in a new tab, the svg [0] is interactive! You can zoom in by clicking on sections, and there's a button to reset the zoom level.

      [0]: https://questdb.com/images/blog/2026-01-13/before.svg

    • jabwd 8 hours ago

      I might be very wrong in every way but, string parsing and or manipulating and memoiziation... sound like a super strange combo? For the first you know you're already doing expensive allocations, but the 2nd is also not a pattern I really see apart from in JS codebases. Could you provide more context on how this actually bit you in the behind? memoizing strings seems like a complicated and error prone "welp it feels better now" territory in my mind so I'm genuinely curious.

      • shermantanktop 6 hours ago

        In Java it can be a bad toString() implementation hiding behind a + used for string assembly.

        Or another great one: new instances of ObjectMapper created inside a method for a single call and then thrown away.

        • shermantanktop 6 hours ago

          To be clear this is often sloppy code that shouldn’t have been written. But in a legacy codebase this stuff can easily happen.

      • tyingq 7 hours ago

        > but the 2nd is also not a pattern I really see apart from in JS codebases.

        If you're referring to "one-line logger calls that trigger expensive serialization", it's also common in java.

    • sroerick 7 hours ago

      I've never used flamegraphs but would like to know about them. Can you explain more? Or where should I start?

  • jerrinot 11 hours ago

    Author here. After my last post about kernel bugs, I spent some time looking at how the JVM reports its own thread activity. It turns out that "What is the CPU time of this thread?" is/was a much more expensive question than it should be.

    • jacquesm 11 hours ago

      I don't think it is possible to talk about fractions of nanoseconds without having an extremely good idea of the stability and accuracy of your clock. At best I think you could claim there is some kind of reduction but it is super hard to make such claims in the absolute without doing a massive amount of prep work to ensure that the measured times themselves are indeed accurate. You could be off by a large fraction and never know the difference. So unless there is a hidden atomic clock involved somewhere in these measurements I think they should be qualified somehow.

      • rcxdude 10 hours ago

        Stability and accuracy, when applied to clocks, are generally about dynamic range, i.e. how good is the scale with which you are measuring time. So if you're talking about nanoseconds across a long time period, seconds or longer, then yeah, you probably should care about your clock. But when you're measuring nanoseconds out of a millisecond or microsecond, it really doesn't matter that much and you're going to be OK with the average crystal oscillator in a PC. (and if you're measuring a 10% difference like in the article, you're going to be fine with a mechanical clock as your reference if you can do the operation a billion times in a row).

        • jacquesm 10 hours ago

          This setup is a user space program on a machine that is not exclusively dedicated to the test running all kinds of interrupts (and other tasks) left, right and center through the software under test.

    • Neywiny 11 hours ago

      Did you look into the large spread on your distributions? Some of these span multiple orders of magnitude which is interesting

      • jerrinot 11 hours ago

        Fair point. These were run on a standard dev workstation under load, which may account for the noise. I haven't done a deep dive into the outliers yet, but the distribution definitely warrants a more isolated look.

    • 6r17 10 hours ago

      Very thankful for the 1liner tldr

      edit : I had an afterthought about this because it ended up being a low quality comment ;

      Bringing up such TLDR give a lot of value to reading content, especially on HN, as it provides way more inertia and let focus on -

      reading this short form felt like that cool friend who gave you a heads up.

      • jerrinot 10 hours ago

        I was unsure whether to post it or not so I am glad you found it useful!

        • 6r17 10 hours ago

          I have that 10-30s time window to fill when claude might be loading some stuff ; the 1 liner is exactly what fits in that window - it makes me wonder about the original idea of twitter now that I think of it - but since it's not the same kind of content I don't bother with it.It really feels like "here is the stuff, here's more about it if you want to" - really really appreciate that form and will definitely do the same format myself

    • abicklefitch 7 hours ago

      Quelle Suprise

  • furyofantares 9 hours ago

    > Flame graph image

    > Click to zoom, open in a new tab for interactivity

    I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.

    • jerrinot 9 hours ago

      Courtesy of Brendan Gregg and his flamegraph.pl scripts: https://github.com/brendangregg/FlameGraph

      Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.

  • pjmlp 2 hours ago

    Which goes to show writing C, C++ or whatever systems language isn't automatically blazing fast, depending on what is being done.

    Very interesting read.

  • amelius 39 minutes ago

    It's kinda crazy the amount of plumbing required to get a few bits across the CPU.

  • jonasn 2 hours ago

    Author of the OpenJDK patch here.

    Thanks for the write-up Jaromir :) For those interested, I explored memory overhead when reading /proc—including eBPF profiling and the history behind the poorly documented user-space ABI.

    Full details in my write-up: https://norlinder.nu/posts/User-CPU-Time-JVM/

    • jerrinot an hour ago

      Hi Jonas, thanks for the work on OpenJDK and the post! I swear I hadn't seen your blog :) I finished my draft around Christmas and it’s been in the queue since. Great minds think alike, I guess.

      edit: I just read your blog in full and I have to say I like it more than mine. You put a lot more rigor into it. I’m just peeking into things.

      edit2: I linked your article from my post.

  • higherhalf 10 hours ago

    clock_gettime() goes through vDSO, avoiding a context switch. It shows up on the flamegraph as well.

  • Ono-Sendai 2 hours ago

    "look, I'm sorry, but the rule is simple: if you made something 2x faster, you might have done something smart if you made something 100x faster, you definitely just stopped doing something stupid"

    https://x.com/rygorous/status/1271296834439282690

  • goodroot 9 hours ago

    The QuestDB team are among the best doing it.

    Love the people and their software.

    Great blog Jaromir!

  • ee99ee 11 hours ago

    This is such a great writeup

  • otterley 8 hours ago

    It took seven years to address this concern following the initial bug report (2018). That seems like a lot, considering how instrumenting CPU time can be in the hot path for profiled code.

    • loeg 7 hours ago

      400x slower than 70ns is still only 28us. How often is the JVM calling this function?

      • otterley 5 hours ago

        It depends. If you’re doing continuous profiling, it’d make a call to get the current time at every method entry and exit, each of which could then add a context switch. In an absolute sense it appears to be small, but it could really add up.

        This is what flame graphs are super helpful for, to see whether it’s really a problem or not.

        Also, remember that every extra moment running instructions is a lost opportunity to put the CPU to sleep, so this has energy efficiency impact as well.

        • loeg 4 hours ago

          If it's calling it twice per function, that's enormously expensive and this is a major win.

  • xthe 9 hours ago

    This is a great example of how a small change in the right place can outweigh years of incremental tuning.

    • nomel 9 hours ago

      I don't think I've ever seen less than 10x speedup after putting some effort into improving performance of "organic"/legacy code. It's always shocking how slow code can be before anyone complains.

  • squirrellous 5 hours ago

    Does anyone knowledgeable know whether it’s possible to drastically reduce the overhead of reading from procfs? IIUC everything in it is in-memory, so there’s no real reason reading some data should take the order of 10us.

  • tomiezhang 5 hours ago

    cool