How we made a Ruby method 200x faster

(campsite.com)

42 points | by nholden 5 days ago ago

37 comments

  • chikere232 4 hours ago

    The title kinda glossed over the fact that they started out with working, fast code, and then broke it. Sure, their fix was faster than their most broken version, but it's less impressive than starting with slow code and improving it.

    • bastawhiz an hour ago

      They said they made the change because the code was starting to become hard to maintain. That's not a terrible reason for refactoring.

      • notjoemama 28 minutes ago

        I think they were referring to the degree of speed up.

        • bastawhiz 8 minutes ago

          The degree of speedup is the refactored code being fixed to not be slow.

  • cluckindan 4 hours ago

    This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

    As in, they are matched right-to-left, which implies that a selector like ”p a” first selects all the <a> nodes, and for each of them, it then traverses up the DOM tree until it encounters a <p> node (selector matches) or the root node (selector doesn’t match).

    That said, the traversing shouldn’t happen for plain tag selectors like ”h1”. There must be something wrong with the library they used.

    • rco8786 3 hours ago

      > This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

      For sure, but that's somewhat reductive. This is not exactly common knowledge. Certainly not something you would expect any given engineer to have immediately jump to mind.

    • the_other 3 hours ago

      Is that really how it works in browsers and other rendering engines?

      Intuition suggests to me that it wouldn’t start with CSS and then find all the matching DOM nodes. I would expect it started at each DOM node and then found the CSS rules which might apply.

      So “I’m adding an A to the tree; what are all the CSS rules with or A or * at the rightmost token; which of that set applies to my current A; apply the rules in that sub set”. Going depth first into the DOM like this should result in skipping redundant CSS, and (as my imagination draws it) reduce DOM traversals.

      • esprehn 2 hours ago

        There's three different modes of running a selector in typical browsers:

          (a) Element#matches
          (b) Element#querySelector(All)
          (c) By the engine for updating style and layout
        
        
        The GP seems to be talking about (b), but even then browsers are checking each element one by one not advancing through the selector state machine in parallel for every element. (There's one exception in the old Cobalt which did advance the state machines IIRC).

        (a) and (c) are conceptually very similar except that when doing (c) you're checking many elements at the same time so browsers will do extra upfront costs like filling bloom filters for ancestors or index maps for nth-child.

        In TFA they're doing .matches() which I would expect to be slower than a hash map lookup, but for a simple selector like they're doing (just tag name) it shouldn't do much more then:

          (1) Parse the selector, hopefully cache that in an LRU
          (2) Execute the selector state machine against the element 
            (2.1) Compare tagName in the selector
        
        
        Apparently Nokogiri implements CSS in a very inefficient way though by collecting ancestors and then converting the CSS into xpath and matching that:

        https://github.com/sparklemotion/nokogiri/blob/e8d30a71d70b2...

        https://github.com/sparklemotion/nokogiri/blob/e8d30a71d70b2...

        I'd expect that to be an order of magnitude slower than what a browser does.

      • cluckindan 2 hours ago

        In browsers, DOM parsing starts before (all) CSS is loaded and parsed. Also, the sizes of elements in the flow are (by default) dictated by the text content, so it really does not make sense to try to paint a page in a root-to-leaf order.

  • Alifatisk 5 hours ago

    That's a huge improvement but damn, the fixed code didn't look any better in my eyes.

    Going from

      HANDLERS = [
        Text,
        List,
        ListItem,
        Code,
        # ...
      ].freeze
    
    to

      HANDLERS_BY_NODE_NAMES = [
        Text,
        List,
        ListItem,
        Code,
        # ...
      ].each_with_object({}) do |handler, result|
        handler::NODE_NAMES.each { |node_name| result[node_name] = handler }
      end.freeze
    • viraptor 5 hours ago

      I'd go with this since it's not performance critical code. Not sure if it's that more readable, but I like it better:

          BY_NODE_NAMES = HANDLERS.map {|h|
            h::NODE_NAMES.map {|n| [n, h]}
          }.flatten(1).to_h
      • ngcazz 43 minutes ago

        I think we can go harder on the std lib :)

          HANDLERS.flat_map { _1.node_names.index_with(_1) }.inject(&:merge)
        
        
        (nb: assuming there exists a `.node_names` to expose the constant... just because I like always using method calls)
    • masklinn 5 hours ago

      A bigger question for me would be why the handlers don’t register themselves. It should be a very small amount of meta-programmation, and would avoid having to repeat the handlers to register them.

      • IshKebab 4 hours ago

        Self-registration is usually an anti-pattern in my experience because it introduces globals. Sometimes you can't avoid it, but if you only have a few things to register it's usually better to just list them explicitly.

        • barrkel 4 hours ago

          Self-registration has a lot of problems.

          Global state, like you say - you can't merge two apps that use the same registration mechanism but use incompatible registered sets.

          Lack of discoverability in maintenance - it's a kind of COME FROM but for data.

          A barried to optimization: it's not clear what will break when you remove a dependency, and you have to eagerly run all initializers everywhere - if you try to be clever and lazy, you have to pick and choose, and you're back to explicit but indirect registration.

          Initialization order problems: if you have code you run during init which depends on the stuff you register at init time, you're going to have to manually manage your initialization in error-prone ways. Adding or removing dependencies may change initialization order.

        • masklinn 4 hours ago

          > Self-registration is usually an anti-pattern in my experience because it introduces globals.

          You mean unlike explicit registration introducing the HANDLERS_BY_NODE_NAMES global?

          • IshKebab 3 hours ago

            Well I guess it isn't quite as simple as I implied.

            The handlers don't know anything about HtmlHandler, so you are free to make OtherHtmlHandler or whatever. The dependency direction is correct, whereas with self-registration your handlers now depend on some single unique global registry. HANDLERS_BY_NODE_NAMES doesn't affect any other code that might interact with the handlers (tests is normally the big one).

            barrkei gave some very good reasons to avoid self-registration.

  • kevmo314 4 hours ago

    I'm curious how the case/when version performs. Unlike the author, I don't think that is any smellier than the list/map solution they've come up with.

    • chikere232 4 hours ago

      "we improved performance with a simple trick! (rollback our changes)"

  • franciscop 2 hours ago

    I've flamegraph-debugged JS code from time to time, and it usually feels a lot more of a craft and "educated guesses" than the vast majority of programming things I do. I usually only get down to it when there's an actual perf problem so YMMV, but I'm curious, do I do JS flamegraph debugging wrong, or is it something like this for everyone?

    - 20% of the times you get lucky and find a very easy win that speeds up things 90%+. Similar to this post, usually when a single method/call takes a huge chunk of the work.

    - 50% of the times you grind at it and can get 30-50% speed up. I usually try many things, and only some of them do make a difference.

    - 30% of the time absolutely no luck! Many small calls where each is unavoidable, no repeated code, etc.

    • swatcoder 2 hours ago

      Keep in mind that there are many layers of complex systems between your JS code and what'll end up happening on your system when it's run.

      The code defines what the state should look like after its done executing. It expresses your intent. But that code gets transformed several times on the way to being executed and then the hardware can apply mang different possible approaches to executing it when the time comes.

      Moreso every year, many of those software transformations, as well as the hardware's execution technique, are quite aggressive about revisiting your program's intent with optimizations (of some kind) that make sense within that context.

      The upshot is that the farther you are from your hardware, the more of these layers there are between your code and its execution, the less influence and insight you have over what actually "physically" happens during execution.

      When it comes to profiling and optimization of high-level programs like those written in Javascript, this means that it can ve somewhere between hard and impossible to predict how your code changes will actually impact performance.

      Radical algorithm redesign can often yield salient diffferences that feel largely predictable, but smaller "precision" changes are often going to be a crap shoot. All those layers between you and the hardware were making optimzations already anyway, and your "precision" change may just as easily confound those existing optimizations as well as it might trigger some other. The results are tricky.

      This is even true in lower-level code, where we're encouraged to do things like inspect compiler output on godbolt or in our compilation output and always confirm our expectations with a profiler (which often proves our guesses wrong). But it's all that much more pronounced in high-level ones.

      So ultimately, yes, assuming your prevailing algorithms are generally optimal, profiling and optimization is almost always going to feel like a guess-and-test process. But that's okay, because you can test and those tests are usually (not always) telling you if you've made a meaningful difference or not.

    • jesse__ 2 hours ago

      I do low level systems programming, so pretty different from JS-land, but I feel the techniques you should apply when doing optimization generally apply at any level/language.

      0) algorithmic improvement. Obvious shit like do a quick sort instead of bubble sort (assuming N > 64, or whatever), not doing unnecessary work in a hot loop, etc

      1) reduce memory footprint. The slowest part of your program is almost always just waiting for memory, unless you're doing something that's heavily CPU bound. Web applications are probably always memory bound. Reducing the amount of memory the function you're optimizing operates on reduces DCache misses, which are expensive.

      2) Do batch operations. Once I've got something to a point where it's not completely braindead (which, honestly, is where I stop most of the time), I look to start batching things. Usually look to do 8 or 16 at a time in the hopes the compiler/runtime can make some use of SIMD. Use STATIC LOOPS ie (for 0..8) so the compiler can unroll the loop. That's extremely important.

      3) probably unavailable (unless you want to/can drop into WASM), but the next step is usually SIMD. This is a rabbit hole, but if you want/need another ~8x perf improvement, this is how to get it

      4) once all that's done, it's probably close to optimal in terms of cycles per element (unless I did something boneheaded, which is common). Last step is to multithread it if it needs even more juice. This can range from trivial to completely impossible depending on the algorithm. In JS land, you need to make sure you operate on SharedAreayBufferrs when doing multithreading for performance, because web workers copy the input/output values by default.

      Anywhoo.. maybe that helps.

      When I try to optimize something lightly, it's not uncommon for me to get 10x improvement fairly easily. When I optimize something to within an inch of it's life, I can sometimes get three or even four orders of magnitude faster.

      • jesse__ 2 hours ago

        EDIT: I forgot to mention that for tight performance, avoid branches. This means ifs, switchs, loops, goto, etc. Sometimes you need branches, but mispredicted branches can be extremely costly, causing pipeline stalls and flushes. This is why using a static loop is important; so the compiler can unroll it and not use a branch.

        I also should mention that I hate flamegraphs. They only give you a bare minimum amount of information for doing performance work. I'm not sure of a good JS profiler, but what you want to be able to do is mark up the sections of code you want profiled, instead of the profiler taking random samples and squashing them all together. Look at the tracy profiler for an example

  • mewpmewp2 an hour ago

    I do wonder if the refactor will actually be better. The node by names seems like a scarier exceptional case that forces you down the road compared to when or whatever case being more straightforward. I think the OOP would work better if each node handler defined their own matcher.

  • gjtorikian 2 hours ago

    You may also be interested in https://github.com/gjtorikian/html-pipeline (or its main dependency, https://github.com/gjtorikian/selma), for high performance HTML manipulation.

  • yayoohooyahoo 4 hours ago

    The title implies they fixed a method in Ruby itself which would have been a lot more interesting than this article.

  • hartator 3 hours ago

    You can also switch to Nokolexbor, our drop-in replacement for Nokogiri: https://github.com/serpapi/nokolexbor

    It should almost 1,000 faster for this kind of CSS lookups.

  • rand0mstring 34 minutes ago

    by deleting it

  • jonstewart 2 hours ago

    I don't understand "how we made X in Ruby/Python Y% faster" posts. It is of course possible to optimize functions in any language, and often worthwhile to do, but if you're going to spend a lot of engineering resources on it, then can I introduce you to my friends C++ and Rust?

  • andrewstuart 4 hours ago

    The waterfall of end statements in Ruby reminds me of pascal. Seems verbose.

    • pansa2 4 hours ago

      Nobody ever likes my suggestion to write them all on one line. I thought it was neat - the length of the word `end` makes it line up perfectly with 4-space indentation:

          class HtmlTransform
              class Code < Base
                  def markdown
                    "`#{node.text}`"
          end end end
      • Etheryte 4 hours ago

        I think it's pretty easy to see why people would dislike it, with each on their own line and indented, it's very easy to track what ends where. With this version, not so much, if you're e.g. five nests deep and then see three end statements on one line.

        • pansa2 3 hours ago

          I think this is fine - I don't see why having the `end`s on separate lines would make it easier to understand:

              if ...
                  if ...
                      if ...
                          if ...
                              if ...
                                  x = 1
                      end end end
                      y = 2
              end end
          • Borg3 2 hours ago

            When I see such code I chukle... Really? I always try to make my code as flat as possible, either using next or break (or split to function and use return). Thats why I sometimes miss goto. But case can emulate it pretty fine.

    • rco8786 3 hours ago

      Interesting. I consider myself a rubyist and never considered this. Perhaps because the rest of the language is so concise that this little verbosity never really bothered me

  • cess11 5 hours ago

    "Use the index, Luke".

    • scotty79 4 hours ago

      Use the hashmap.

      Also don't replace string comparison with CSS selector search and expect it to be fast.