The "high-level CPU" challenge (2008)

(yosefk.com)

72 points | by signa11 2 days ago ago

51 comments

"You don't think about memories when you design hardware"

I found that comment interesting because memory is mostly what I think about when I design hardware. Memory access is by far the slowest thing, so managing memory bandwidth is absolutely critical, and it uses a significant portion of the die area.

Also, a significant portion of the microarchitecture of a modern CPU is all about managing memory accesses.

If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access, then it would likely tie in performance, assuming the CPU team did a good job measuring memory. Still, it wouldn't need all the extra stuff that did the measurement. So I see that as a win.

The main problem is, I have absolutely no idea how to do that, and unfortunately, I haven't met anyone else who knows either. Hence, tons of bookkeeping logic in CPUs persists.

[-]

Legend2440 a day ago

Keep in mind this article is almost 20 years old.

Memory access used to matter less because CPUs were a lot slower relative to memory. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.

[-]

PeterWhittaker a day ago

Wait, what?

Oh, yeah. Jeebus.

2008 was SEVENTEEN years ago.

I turn 60 soon. Anything 2xxx just seems like almost yesterday, 199x isn't that far away in mind, and even 198x is pretty close.

I love math (B.Sc. Physics, so more math than anything else) but sometimes math hurts.

Damn you, beloved math.

gchadwick a day ago

This certainly rings true with my own experiences (worked on both GPUs and CPUs and now doing AI inference at UK startup Fractile).

> If there were a hypothetical high-level CPU language that somehow encoded all the information the microarchitecture needs to measure to manage memory access,

I think this is fundamentally impossible because of dynamic behaviours. It's very tempting to assume that if you can be clever enough you can work this all out ahead of time, encode what you need in some static information and the whole computation just runs like clockwork on the hardware, no or little area spent on scheduling, stalling, buffering etc. Though I think history has shown over and over this just doesn't work (for more general computation at least, more feasible in restricted domains). There's always lots of fiddly details in real workloads that surprise you and if you've got an inflexible system you've got no give and you end up 'stopping the world' (or some significant part of it) to deal with it killing your performance.

Notably running transformer models feels like one of those restrictive domains you could do this well in but dig in and there's plenty enough dynamic behaviour in there that you can't escape the problems they cause.

[-]

throwaway31131 17 hours ago

> I think this is fundamentally impossible because of dynamic behaviours.

I was thinking that too as it sounds awfully close to a variant of the halting problem.

But since I am not sure that it's actually equivalent, and there might be some heuristic that constrains the problem, in theory, to make it solvable "enough" I went with the softer "I don't know how to do it" rather than "it can't be done"

eigenform a day ago

Also weird because pipelining is very literally "insert memories to cut some bigger process into parts that can occur in parallel with one another"

foota a day ago

This is almost the opposite of what you're referring to I think, but interestingly Itanium had instruction level parallelism hard-coded by the compiler as opposed to determined by the processor at runtime.

uticus a day ago

"memories" could also refer to registers, or even a stack-based approach

[-]

blakepelton a day ago

Some folks use the term "state element" to describe the set of things which hold state between clock cycles. Memories (e.g., SRAM) and registers are both examples of state elements.

I took the "You don't think about memories when you design hardware" paragraph from Yosef to mean that SRAM is so highly tuned that in most cases your best bet is to just use it "off the shelf" rather than invent some novel type of state element. And if you use SRAM, then you are stuck sucking through a small straw (e.g., an SRAM can hold 1K words, but you can only read 1 word/cycle).

jonathaneunice 2 days ago

"High-level CPUs" are a tarpit. Beautiful idea in theory, but history shows they're a Bad Idea.

Xerox, LMI, Symbolics, Intel iAPX 432, UCSD p-System, Jazelle and PicoJava—just dozens of fancy designs for Lisp, Smalltalk, Ada, Pascal, Java—yet none of them ever lasted more than a product iteration or three. They were routinely outstripped by MOT 68K, x86, and RISC CPUs. Whether your metric of choice is shipment volumes, price/performance, or even raw performance, they have routinely underwhelmed. A trail of tears for 40+ years.

[-]

switchbak a day ago

I'm just a software guy, so my thoughts are probably painfully naive here.

My understanding is that we're mostly talking about economics - eg: that there's no way a Java/Lisp CPU could ever compete with a general purpose CPU. That's what I thought was the reason for the Symbolics CPU decline vs general purpose chips.

It seems like some hardware co-processing go a long way for some purposes though? GC in particular seems like it would be amenable to some dedicated hardware.

These days we're seeing dedicated silicon for "AI" chips in newer processors. That's not a high level CPU as described in the article, but it does seem like we're moving away from purely general purpose CPUs into a more heterogeneous world of devoting more silicon for other important purposes.

[-]

bigfishrunning a day ago

Those "AI" chips generally just optimize around a vector-multiply, allowing more parallelism with less precision. This is almost a lower level processor then a general purpose CPU, and doesn't really map to the "string compare instruction" type of functionality that Yossi Kreinin describes in the essay.

wmf a day ago

If you want to build something that is used you can't ignore economics. But in this case I think high-level CPUs are worse even if you could put economics aside.

GC in particular seems like it would be amenable to some dedicated hardware.

Azul went through that incredible journey.

irq-1 a day ago

That's true, but isn't it an issue of price and volume? Specialized network cards are OK (in the market.)

[-]

jonathaneunice a day ago

Probably! Everything runs on economics. Network and graphics accelerators did just fine—but then they were (or became over time) high-volume opportunities. Volume drives investment; investment drives progress.

do_not_redeem a day ago

Those specialized network cards have ASICs that parse network packets and don't need additional memory to do their job. You can easily build a fixed-size hardware register to hold an IPv4 packet header, for example.

But a "high level CPU" is all about operating on high-level objects in memory, so your question becomes, how do you store a linked list in memory more efficiently than a RISC CPU? How do you pointer chase to the next node faster than a RISC CPU? Nobody has figured it out yet, and I agree with the article, I don't see how it's possible. CPUs are already general-purpose and very efficient.

theredleft a day ago

Throwing in "a trail of tears" at the end is pretty ridiculous. It's just hardware.

[-]

Dylan16807 a day ago

Did you consider maybe they weren't making a reference.

delta_p_delta_x 2 days ago

As James Mickens says in The Night Watch[1],

  “Why would someone write code in a grotesque language that exposes raw memory addresses? Why not use a modern language with garbage collection and functional programming and free massages after lunch?” Here’s the answer: Pointers are real. They’re what the hardware understands. Somebody has to deal with them. You can’t just place a LISP book on top of an x86 chip and hope that the hardware learns about lambda calculus by osmosis.

But I feel there's a middle ground between LISP and raw assembly and mangling pointers. For instance, modern C++ and Rust (and all the C++ 'successors', like Carbon, Circle, Safe C++, etc) have a very novel and clear set of ownership, view, and span semantics that make more declarative, functional, side-effect-free programming easy to write and read, while still maintaining high-performance—most notably without unnecessarily throwing copies around.

A few days ago a friend was asking me for a HPC problem: he'd written a bog-standard nested for loop with indices, was observing poor performance, and asked me if using C++23 std::ranges::views would help. It did, but I also fundamentally reworked his data flow. I saw something like a 5x performance improvement, even though I was writing at a higher level of abstraction than he was.

[1]: https://www.usenix.org/system/files/1311_05-08_mickens.pdf

mikewarot 2 days ago

It's my strong opinion that Von Neumann's architecture is great for general purpose problem solving. However for computing LLMs and similar fully known execution plans, it would be far better to decompose them into a fully parallel and pipelined graph to be executed on a reconfigurable computing mesh.

My particular hobby horse is the BitGrid, a systolic array of 4x4 bit look up tables clocked in 2 phases to eliminate race conditions.

My current estimate is that it could save 95% of the energy for a given computation.

Getting rid of RAM, and only moving data between adjacent cells offers the ability to really jack up the clock rate because you're not driving signal across the die.

[-]

elchananHaas 2 days ago

For sure. The issue is that many AI workloads require terabytes per second of memory bandwidth and are on the cutting edge of memory technologies. As long as you can get away with little memory usage you can have massive savings, see Bitcoin ASICs.

The great thing with the Von Neumann architecture is it is flexible enough to have all sorts of operations added to it; Including specialized matrix multiplication operations and async memory transfer. So I think its here to stay.

[-]

mikewarot a day ago

The reason they need terabytes per second is they're constantly loading weights and data into and then back out of multiply accumulator chips.

If you program hardware to just do the multiply/accumulate with the weights built in, you can then reduce the bandwidth required to just putting data into a "layer" and getting the results out, a much, MUCH lower amount of data in most cases.

[-]

thrtythreeforty a day ago

Doesn't this presume that you can fit the model's weights into the SRAMs of a single chip, or of multiple chips that you plan to connect together? A big reason HBM is a thing is because it's much, much denser than SRAM.

a day ago

[deleted]

Dylan16807 a day ago

That sounds like an FPGA, but aren't those notoriously slow for most uses? You can make a kickass signal pipeline or maybe an emulator but most code gets shunted to an attached arm core or even a soft-core implemented on top that wastes an order of magnitude of performance.

And no architecture is clock limited by driving signals across the die. Why do people keep assuming this? CPU designers are very smart and they break slow things into multiple steps.

[-]

bigfishrunning a day ago

> That sounds like an FPGA, but aren't those notoriously slow for most uses?

If you're trying to run software on them yes. If you use them for their stated purpose (as an array of gates), then they can be orders of magnitude faster then the equivalent computer program. It's all about using the right tool for the right job.

librasteve 2 days ago

occam (MIMD) is an improvement over CUDA (SIMD)

with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):

  -  8000 CPUs in same die area as an Apple M2 (16 TIPS) (ie. 36x faster than an M2)
  - 40000 CPUs in single reticle (80 TIPS)
  - 4.5M CPUs per 300mm wafer (10 PIPS)

the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnect

heat would be the biggest issue ... but >>50% of each CPU is low power (local) memory

[-]

xphos a day ago

Nit pick here but ...

I think CUDA shouldn't be label as SIMD but SIMT. The difference in overhead between the two approaches is vast. A true Vector machine is far more efficient but with all of the massive headaches of actually programming it. CUDA and SIMT has a huge benefit in that if statement actually execute different codes for active/inactive bins. I.e different instructions execute on the same data in some cases which really aids. Your view might also be the same instructions operate on different datas but the fork and join nature behaves very different.

I enjoyed your other point though about the comparsions of machines though

[-]

sifar a day ago

Really curious about why you think programming a vector machine is so painful ? In terms of what ? And what do you exactly mean by a "true Vector Machine" ?

My experience with RVV (i am aware of vector architecture history, just using it as an example)so far indicates that while it is not the greatest thing, it is not that bad either. You play with what you have!!

Yes, compared to regular SIMD, it is a step up in complexity but nothing a competent SIMD programmer cannot reorient to. Designing a performant hardware cpu is another matter though - lot of (micro)architectural choices and tradeoffs that can impact performance significantly.

[-]

xphos 14 hours ago

No it in terms of flexibility of programming the SIMD is less flexible if you need any decision making. In SIMD you are also typically programming in Intrinsics not at a higher level like in CUDA.

For example I can do for (int tid = 0 ; tid<n; tid+= num_threads) { C[tid] = A[tid] * B[tid] + D[tid]; }

In SIMD yes I can stride the array 32 / (or in rvv at vl) at a time but generally speaking as new Archs come along I need to rewrite that loop for the wider Add and mpy instructions and increase width of lanes etc. But in CUDA or other GPU SIMT strategies I just need to bump the compiler and maybe change 1 num_threads variable and it will be vectorizing correctly.

Even things like RVV which I am actually pushing for my SIMD machine to move toward these problems exists because its really hard to write length agnostic code in SIMD intrinsics. That said there is a major benefit in terms of performance per watt. All that SIMT flexibility costs power that's why Nvidia GPUs can burn a hole through the floor while the majority of phones have a series of vector SIMD machines that are constantly computing Matrix and FFT operations without your pocket becoming only slightly warmer

PaulHoule 2 days ago

The classic example is the Lisp Machine. Hypothetically a purpose-built computer would have an advantage running Lisp, but Common Lisp was carefully designed so that it could attain high performance on an ordinary "32-bit" architecture like the 68k or x86 as well as SPARC, ARM and other RISC architectures. Java turned out the same way.

It's hard to beat a conventional CPU tuned up with superscalar, pipelines, caches, etc. -- particularly when these machines sell in such numbers that the designers can afford heroic engineering efforts that you can't.

[-]

gizmo686 2 days ago

It's also hard to beat conventional CPUs when transitor density doubles every two years. By the time your small team has crafted their purpose built CPU, the big players will have released a general purpose CPU on the next generation of manufacturing abilities.

I expect that once our manufacturing abilities flatline, we will start seeing more interest in specialized CPUs again, as it will be possible to spend a decade designing your Lisp machine

[-]

PaulHoule 2 days ago

I never thought I could make a custom CPU until I came across

https://en.wikipedia.org/wiki/Transport_triggered_architectu...

But when I saw that I thought, yeah, I could implement something like that on an FPGA. It's not so much a language-specific CPU as an application-specific CPU. If you were building something that might be FPGA + CPU or FPGA with a soft core it might be your soft core, particularly if you had the right kind of tooling. (Wouldn't it be great to have a superoptimizing 'compiler' that can codesign the CPU together with the program?)

It has its disadvantages, particularly the whole thing will lock up if a fetch is late. For workloads where memory access is predictable I'd imagine you could have a custom memory controller but my picture of how that works is fuzzy. For unpredictable memory access though you can't beat the mainstream CPU -- me and my biz dev guy had a lot of talks with a silicon designer who had some patents for a new kind of 'vector' processor who schooled us on how many ASIC and FPGA ideas that sound great on paper can't really fly because of the memory wall.

yjftsjthsd-h 2 days ago

There's also a matter of scale. By the time you've made your first million custom CPUs, the next vendor over has made a billion generic CPUs, sold them, and then turned the money back into R&D to make even better ones.

wbl 2 days ago

John McCartey had amazing prescience given that Lisp was created in 1956 and the RISC revolution wasn't until at least 1974.

There's certainly some dynamic language support in CPUs: the indirect branch predictors and target predictors wouldn't be as large if there wasn't so much JavaScript and implementations that make those circuits work well.

[-]

PaulHoule 2 days ago

Half of the Dragon Book is about parsing and the other half is about reconciling the lambda calculus (any programming language with recursive functions) with the Von Newman/Turing approach to computation. Lisp manages to skip the first.

Legend2440 a day ago

>in fact lots of standard big O complexity analysis assumes a von Neumann machine – O(1) random access.

In practice, this is a lie - real computers do not offer O(1) random access. Physics constrains you to O(sqrt N).

https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

dang a day ago

Related. Others?

The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=13741269 - Feb 2017 (127 comments)

The “high-level CPU” challenge (2008) - https://news.ycombinator.com/item?id=10358153 - Oct 2015 (58 comments)

The "high-level CPU" challenge - https://news.ycombinator.com/item?id=107221 - Jan 2008 (4 comments)

gwbas1c a day ago

I get that today's CPUs are orders of magnitude faster than the CPUs of the past. I also think it's important to admit that today's computers aren't orders of magnitude faster than yesterday's computers.

There's many reasons for that, but mostly it's because it's not worth it to optimize most use cases. It's clear that, for the author's use case (image recognition,) it's "worth it" to optimize to the level they discuss.

Otherwise, we shouldn't assume that doubling a CPU's speed will magically double an application's speed. The author alludes to bottlenecks inside the computer that appear "CPU bound" to many programmers. These bottlenecks are still there, even when tomorrow's CPU is "twice as fast" as today's CPUs.

xphos a day ago

I think LISP is an idea but what about other languages like BQN. Horrible to type but its can represent some really high level algorithms and has a totally different idea on what instructions can be. The idea of having inner and outer products formalized is cool. There are other like it since its more array programming oriented but most seem to write an efficient C impl of a stdlib but an instruction set for them would probably work since its not really about memory allocations but operations on memories.

It's not suited at least it my naive understanding to general purpose computing but it carves its own space.

https://mlochbaum.github.io/BQN/index.html

pyrolistical a day ago

Clearly memory latency is the bottleneck. I see 2 paths we can take.

1. Memory orientated computers

2. Less memory abstractions

We already did the first, they are called GPUs.

What I imagine for the second is full control over the memory hierarchy. L1, L2, L3, etc are fully controlled by the compiler. The compiler (which might be just in time) can choose the latency vs throughput. It knows exactly how all the caches are connected to all numa nodes and their latencies.

The compiler can choose to skip a cache level for latency critical operations. The compiler knows exactly the block size and cache capacity. This way it can shuffle data around with pipelining to hide latency, instead of accidentally working right now.

nromiun 2 days ago

But what is the fundamental difference between C compiling to assembly and Luajit doing the same as JIT? Both are very fast. Both are high level compared to assembly. Except one gets the low level language tag and the other one does not.

I don't buy the argument that you absolutely need a low level language for performance.

[-]

IainIreland 2 days ago

This isn't about languages; it's about hardware. Should hardware be "higher-level" to support higher level languages? The author says no (and I am inclined to agree with him).

[-]

librasteve 2 days ago

this

pinewurst 2 days ago

(2008)

uticus a day ago

> My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it?

suprised no mention of Itanium in the article

nurettin a day ago

Doesn't RISC and SIMD address different dimensions of this problem?

[-]

xphos 14 hours ago

RISC-V or RISC in general not really they are simply the low level CPU model done in a particular way. SIMD I think classifies the same way its still doing low level operations like Adds, Mpys etc but it reduces the overhead if issuing them in larger batches say 32 at a time.

[-]

nurettin an hour ago

I mean some cpu instructions match high level operations done on lists. Loading of arrays of structs, efficient operations on arrays, it feels like the discussions just ignore all this.

moomin 2 days ago

I mean, I don’t have an answer to this, but I’ll bet Simon Peyton-Jones has some ideas…

[-]

wmf a day ago

He doesn't. (At least not good ones.) That's one point of the article: high-level CPUs are something that sounds like a good idea for someone else to work on but that's an illusion.