Memory-Level Parallelism: Apple M2 vs. Apple M4

(lemire.me)

49 points | by zdw a day ago ago

9 comments

Lerc 20 hours ago

I feel like I just read the introduction to a really interesting article.

It really seemrd like there was more to be said there.

elteto 19 hours ago

Agreed, although that is his personal style. Most of his blog posts are interesting although short. I see them more as very well formatted musings.

[-]

Lerc 19 hours ago

It's not necessarily a bad thing, in-fact the world might benefit from more of this style. It's "here's some data, decide what it means for yourself"

I think the expectation of more comes from the experience of predominantly encountering articles with a different form.

[-]

smohare 17 hours ago

[dead]

ashvardanian 19 hours ago

It’s a very interesting benchmark (https://github.com/lemire/TestingMLP) — probably worth adding to the Phoronix or some wider suite.

Every couple of years I refresh my own parallel reduction benchmarks (https://github.com/ashvardanian/ParallelReductionsBenchmark), which are also memory-bound. Mine mostly focus on the boring simple throughput-maximizing cases on CPUs and GPUs.

Lately, as GPUs are pulled into more general data-processing tasks, I keep running into non-coalesced, pointer-chasing patterns — but I still don’t have a good mental model for estimating the cost of different access strategies. A crossover between these two topics — running MLP-style loads on GPUs — might be exactly the benchmark missing, in case someone is looking for a good weekend project!

ericye16 19 hours ago

I wish the chart extended past 28, otherwise how do we know that it tops out there?

[-]

saagarjha 19 hours ago

You don't; the author explains that testing beyond that produces noise that makes it hard to analyze.

[-]

pixelpoet 18 hours ago

It's pretty trivial to keep randomising the array and plot some min/max bands, or just the average.

[-]

smohare 17 hours ago

[dead]