How we made a Ruby method 200x faster

(campsite.com)

65 points | by nholden 8 months ago ago

47 comments

The title kinda glossed over the fact that they started out with working, fast code, and then broke it. Sure, their fix was faster than their most broken version, but it's less impressive than starting with slow code and improving it.

[-]

bastawhiz 7 months ago

They said they made the change because the code was starting to become hard to maintain. That's not a terrible reason for refactoring.

[-]

notjoemama 7 months ago

I think they were referring to the degree of speed up.

[-]

bastawhiz 7 months ago

The degree of speedup is the refactored code being fixed to not be slow.

benmmurphy 7 months ago

I think this might also be a case of a phenomena i see frequently especially in chess and searching for vulnerabilities. If someone gives you a chess puzzle and tells you to find the solution its often much easier to do than finding tactics in your own game. i think if you gave a developer who had some understanding of CSS and asked them what the performance problem was they would be able to identify the `matches?` method as the culprit. but if you wrote this code or reviewed this code and you didn't know there was a problem then i think identifying the `matches?` method as a problem would be more difficult. i've been thinking it might even be free to just assume there is a tactic in a chess position or a problem in some piece of code and use that to change your frame of mind to spot these issues. but i don't think this works out in practice because you might be able to change your frame of mind but this comes at a cost of spending more time and focus on the task.

[-]

wutwutwat 7 months ago

They used their metrics to confirm the regression, and flame graphs to see where the cpu was spending its time. That’s performance tuning basics and anyone who is familiar with profiling code should be able to find the slow spots.

I’ve been doing Ruby for a decade and the matches? method immediately stood out to me. Methods doing any kind of comparison/regex or looping usually end up causing problems. Nokigiri was walking a tree of child nodes and doing a string comparison. I’ve seen people write themselves similar problems with methods like any? includes?, excludes?, etc. Par for the course.

Alifatisk 7 months ago

That's a huge improvement but damn, the fixed code didn't look any better in my eyes.

Going from

  HANDLERS = [
    Text,
    List,
    ListItem,
    Code,
    # ...
  ].freeze

  HANDLERS_BY_NODE_NAMES = [
    Text,
    List,
    ListItem,
    Code,
    # ...
  ].each_with_object({}) do |handler, result|
    handler::NODE_NAMES.each { |node_name| result[node_name] = handler }
  end.freeze

[-]

viraptor 7 months ago

I'd go with this since it's not performance critical code. Not sure if it's that more readable, but I like it better:

    BY_NODE_NAMES = HANDLERS.map {|h|
      h::NODE_NAMES.map {|n| [n, h]}
    }.flatten(1).to_h

[-]

ngcazz 7 months ago

I think we can go harder on the std lib :)

  HANDLERS.flat_map { _1.node_names.index_with(_1) }.inject(&:merge)

(nb: assuming there exists a `.node_names` to expose the constant... just because I like always using method calls)

masklinn 7 months ago

A bigger question for me would be why the handlers don’t register themselves. It should be a very small amount of meta-programmation, and would avoid having to repeat the handlers to register them.

[-]

IshKebab 7 months ago

Self-registration is usually an anti-pattern in my experience because it introduces globals. Sometimes you can't avoid it, but if you only have a few things to register it's usually better to just list them explicitly.

[-]

barrkel 7 months ago

Self-registration has a lot of problems.

Global state, like you say - you can't merge two apps that use the same registration mechanism but use incompatible registered sets.

Lack of discoverability in maintenance - it's a kind of COME FROM but for data.

A barried to optimization: it's not clear what will break when you remove a dependency, and you have to eagerly run all initializers everywhere - if you try to be clever and lazy, you have to pick and choose, and you're back to explicit but indirect registration.

Initialization order problems: if you have code you run during init which depends on the stuff you register at init time, you're going to have to manually manage your initialization in error-prone ways. Adding or removing dependencies may change initialization order.

masklinn 7 months ago

> Self-registration is usually an anti-pattern in my experience because it introduces globals.

You mean unlike explicit registration introducing the HANDLERS_BY_NODE_NAMES global?

[-]

IshKebab 7 months ago

Well I guess it isn't quite as simple as I implied.

The handlers don't know anything about HtmlHandler, so you are free to make OtherHtmlHandler or whatever. The dependency direction is correct, whereas with self-registration your handlers now depend on some single unique global registry. HANDLERS_BY_NODE_NAMES doesn't affect any other code that might interact with the handlers (tests is normally the big one).

barrkei gave some very good reasons to avoid self-registration.

7 months ago

[deleted]

hartator 7 months ago

You can also switch to Nokolexbor, our drop-in replacement for Nokogiri: https://github.com/serpapi/nokolexbor

It should almost 1,000 faster for this kind of CSS lookups.

kevmo314 7 months ago

I'm curious how the case/when version performs. Unlike the author, I don't think that is any smellier than the list/map solution they've come up with.

[-]

chikere232 7 months ago

"we improved performance with a simple trick! (rollback our changes)"

yayoohooyahoo 7 months ago

The title implies they fixed a method in Ruby itself which would have been a lot more interesting than this article.

cluckindan 7 months ago

This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

As in, they are matched right-to-left, which implies that a selector like ”p a” first selects all the <a> nodes, and for each of them, it then traverses up the DOM tree until it encounters a <p> node (selector matches) or the root node (selector doesn’t match).

That said, the traversing shouldn’t happen for plain tag selectors like ”h1”. There must be something wrong with the library they used.

[-]

rco8786 7 months ago

> This should have been obvious before the fact to anyone who understands how CSS selectors work in browsers.

For sure, but that's somewhat reductive. This is not exactly common knowledge. Certainly not something you would expect any given engineer to have immediately jump to mind.

the_other 7 months ago

Is that really how it works in browsers and other rendering engines?

Intuition suggests to me that it wouldn’t start with CSS and then find all the matching DOM nodes. I would expect it started at each DOM node and then found the CSS rules which might apply.

So “I’m adding an A to the tree; what are all the CSS rules with or A or * at the rightmost token; which of that set applies to my current A; apply the rules in that sub set”. Going depth first into the DOM like this should result in skipping redundant CSS, and (as my imagination draws it) reduce DOM traversals.

[-]

esprehn 7 months ago

There's three different modes of running a selector in typical browsers:

  (a) Element#matches
  (b) Element#querySelector(All)
  (c) By the engine for updating style and layout

The GP seems to be talking about (b), but even then browsers are checking each element one by one not advancing through the selector state machine in parallel for every element. (There's one exception in the old Cobalt which did advance the state machines IIRC).

(a) and (c) are conceptually very similar except that when doing (c) you're checking many elements at the same time so browsers will do extra upfront costs like filling bloom filters for ancestors or index maps for nth-child.

In TFA they're doing .matches() which I would expect to be slower than a hash map lookup, but for a simple selector like they're doing (just tag name) it shouldn't do much more then:

  (1) Parse the selector, hopefully cache that in an LRU
  (2) Execute the selector state machine against the element 
    (2.1) Compare tagName in the selector

Apparently Nokogiri implements CSS in a very inefficient way though by collecting ancestors and then converting the CSS into xpath and matching that:

https://github.com/sparklemotion/nokogiri/blob/e8d30a71d70b2...

I'd expect that to be an order of magnitude slower than what a browser does.

cluckindan 7 months ago

In browsers, DOM parsing starts before (all) CSS is loaded and parsed. Also, the sizes of elements in the flow are (by default) dictated by the text content, so it really does not make sense to try to paint a page in a root-to-leaf order.

franciscop 7 months ago

I've flamegraph-debugged JS code from time to time, and it usually feels a lot more of a craft and "educated guesses" than the vast majority of programming things I do. I usually only get down to it when there's an actual perf problem so YMMV, but I'm curious, do I do JS flamegraph debugging wrong, or is it something like this for everyone?

- 20% of the times you get lucky and find a very easy win that speeds up things 90%+. Similar to this post, usually when a single method/call takes a huge chunk of the work.

- 50% of the times you grind at it and can get 30-50% speed up. I usually try many things, and only some of them do make a difference.

- 30% of the time absolutely no luck! Many small calls where each is unavoidable, no repeated code, etc.

[-]

swatcoder 7 months ago

Keep in mind that there are many layers of complex systems between your JS code and what'll end up happening on your system when it's run.

The code defines what the state should look like after its done executing. It expresses your intent. But that code gets transformed several times on the way to being executed and then the hardware can apply mang different possible approaches to executing it when the time comes.

Moreso every year, many of those software transformations, as well as the hardware's execution technique, are quite aggressive about revisiting your program's intent with optimizations (of some kind) that make sense within that context.

The upshot is that the farther you are from your hardware, the more of these layers there are between your code and its execution, the less influence and insight you have over what actually "physically" happens during execution.

When it comes to profiling and optimization of high-level programs like those written in Javascript, this means that it can ve somewhere between hard and impossible to predict how your code changes will actually impact performance.

Radical algorithm redesign can often yield salient diffferences that feel largely predictable, but smaller "precision" changes are often going to be a crap shoot. All those layers between you and the hardware were making optimzations already anyway, and your "precision" change may just as easily confound those existing optimizations as well as it might trigger some other. The results are tricky.

This is even true in lower-level code, where we're encouraged to do things like inspect compiler output on godbolt or in our compilation output and always confirm our expectations with a profiler (which often proves our guesses wrong). But it's all that much more pronounced in high-level ones.

So ultimately, yes, assuming your prevailing algorithms are generally optimal, profiling and optimization is almost always going to feel like a guess-and-test process. But that's okay, because you can test and those tests are usually (not always) telling you if you've made a meaningful difference or not.

jesse__ 7 months ago

I do low level systems programming, so pretty different from JS-land, but I feel the techniques you should apply when doing optimization generally apply at any level/language.

0) algorithmic improvement. Obvious shit like do a quick sort instead of bubble sort (assuming N > 64, or whatever), not doing unnecessary work in a hot loop, etc

1) reduce memory footprint. The slowest part of your program is almost always just waiting for memory, unless you're doing something that's heavily CPU bound. Web applications are probably always memory bound. Reducing the amount of memory the function you're optimizing operates on reduces DCache misses, which are expensive.

2) Do batch operations. Once I've got something to a point where it's not completely braindead (which, honestly, is where I stop most of the time), I look to start batching things. Usually look to do 8 or 16 at a time in the hopes the compiler/runtime can make some use of SIMD. Use STATIC LOOPS ie (for 0..8) so the compiler can unroll the loop. That's extremely important.

3) probably unavailable (unless you want to/can drop into WASM), but the next step is usually SIMD. This is a rabbit hole, but if you want/need another ~8x perf improvement, this is how to get it

4) once all that's done, it's probably close to optimal in terms of cycles per element (unless I did something boneheaded, which is common). Last step is to multithread it if it needs even more juice. This can range from trivial to completely impossible depending on the algorithm. In JS land, you need to make sure you operate on SharedAreayBufferrs when doing multithreading for performance, because web workers copy the input/output values by default.

Anywhoo.. maybe that helps.

When I try to optimize something lightly, it's not uncommon for me to get 10x improvement fairly easily. When I optimize something to within an inch of it's life, I can sometimes get three or even four orders of magnitude faster.

[-]

jesse__ 7 months ago

EDIT: I forgot to mention that for tight performance, avoid branches. This means ifs, switchs, loops, goto, etc. Sometimes you need branches, but mispredicted branches can be extremely costly, causing pipeline stalls and flushes. This is why using a static loop is important; so the compiler can unroll it and not use a branch.

I also should mention that I hate flamegraphs. They only give you a bare minimum amount of information for doing performance work. I'm not sure of a good JS profiler, but what you want to be able to do is mark up the sections of code you want profiled, instead of the profiler taking random samples and squashing them all together. Look at the tracy profiler for an example

strken 7 months ago

> Web applications are probably always memory bound

IO bound and particularly network bound code is common too. The first fix I'd try with network bound code is to either eliminate the network call (local cache? turn a microservice into a library?) or to batch operations.

> Last step is to multithread it if it needs even more juice

In web app land, this is fraught with peril if you're doing it on the server, because it means your code is now competing for n times the resources. Often it's better for one request to take a long time if it means it's using a more predictable amount of memory, not causing other requests to time out, not exhausting your DB connection pool, etc.

I imagine that systems programming is similar in some ways and that's why multithreading is the last resort, just mentioning it because it's easy to shoot oneself in the foot with parallelism.

[-]

jesse__ 7 months ago

Some good points, thanks for the insight :)

Yeah, when I multithread something I pretty much assume that I can hog the whole machine for the time slice the job is going to be running. Said another way, I assume there will be a small number of large jobs running on the machine at any given time, which attempt to saturate the machine. Typically dispatched to a core-locked threadpool of some sort.

Definitely agree with the sentiment that multithreading is hard. Especially when trying to get every last drop..

gjtorikian 7 months ago

You may also be interested in https://github.com/gjtorikian/html-pipeline (or its main dependency, https://github.com/gjtorikian/selma), for high performance HTML manipulation.

mewpmewp2 7 months ago

I do wonder if the refactor will actually be better. The node by names seems like a scarier exceptional case that forces you down the road compared to when or whatever case being more straightforward. I think the OOP would work better if each node handler defined their own matcher.

wutwutwat 7 months ago

Is campsite related at all to 37 signals or basecamp? I feel like it being a ruby shop and being named campsite is a little uninspired if not.

andrewstuart 7 months ago

The waterfall of end statements in Ruby reminds me of pascal. Seems verbose.

[-]

pansa2 7 months ago

Nobody ever likes my suggestion to write them all on one line. I thought it was neat - the length of the word `end` makes it line up perfectly with 4-space indentation:

    class HtmlTransform
        class Code < Base
            def markdown
              "`#{node.text}`"
    end end end

[-]

Etheryte 7 months ago

I think it's pretty easy to see why people would dislike it, with each on their own line and indented, it's very easy to track what ends where. With this version, not so much, if you're e.g. five nests deep and then see three end statements on one line.

[-]

pansa2 7 months ago

I think this is fine - I don't see why having the `end`s on separate lines would make it easier to understand:

    if ...
        if ...
            if ...
                if ...
                    if ...
                        x = 1
            end end end
            y = 2
    end end

[-]

Borg3 7 months ago

When I see such code I chukle... Really? I always try to make my code as flat as possible, either using next or break (or split to function and use return). Thats why I sometimes miss goto. But case can emulate it pretty fine.

rco8786 7 months ago

Interesting. I consider myself a rubyist and never considered this. Perhaps because the rest of the language is so concise that this little verbosity never really bothered me

7 months ago

[deleted]

cess11 7 months ago

"Use the index, Luke".

[-]

scotty79 7 months ago

Use the hashmap.

Also don't replace string comparison with CSS selector search and expect it to be fast.

rand0mstring 7 months ago

by deleting it

jonstewart 7 months ago

I don't understand "how we made X in Ruby/Python Y% faster" posts. It is of course possible to optimize functions in any language, and often worthwhile to do, but if you're going to spend a lot of engineering resources on it, then can I introduce you to my friends C++ and Rust?

[-]

zahlman 7 months ago

Part of the point of using "friendlier" languages is to make it easier (or perhaps feasible) to express better algorithms. Porting existing code to a very different language is not fun, either. No matter how many "engineering resources" you're planning to spend, there are more and less efficient ways of doing so. Very often, "use C or Rust" (or other such languages) is simply not one of the more efficient ways. If it consistently were, other languages wouldn't be nearly as popular as they are.

[-]

liontwist 7 months ago

Expressing a better algorithm, or getting optimizations for free is clear win.

What’s not clear is why these languages should expose more low level performance tuning - such as multithreaded python. This removes invariants which make the language friendly, so that experts can squeeze out a constant speed up factor which is already an order of magnitude off from the proper tool.

knowitnone 7 months ago

I agree