Very interesting to see. Efficiency (E) cores use only 7% of the energy that Performance (P) cores do performing the same task and take about 4x as long to do it.
So about 13.5x (23 J when run on P cores, and less than 1.7 J when run on E cores) the power to do about 4x the performance
Though this would mean a 4x in clock speed would consume 4^3=64 times as much energy, which is more extreme than what is observed here in the Apple chip. So either the clock speed/power relation is different now or the P cores do not actually have a 4x scaling in clock speed. Cache size etc may also play a role in performance.
Per the article, the clock speed difference is much smaller than the ~4x performance difference (4.5 GHz vs 2.6 GHz, i.e. 1.7x). So more than half of the performance advantage of the P cores has to come out of the uarch difference (wider structures etc.). Meanwhile there will be other factors besides clock frequency, e.g. the P cores might use a different cell library than the E cores.
Makes sense. This would suggest the difference in power draw may not mainly come from the clock frequency, since (1.7x)^3=5x, which is significantly less than the 13.5x in power draw.
> This is believed to be identical to ARMv9.2-A without Scalable Vector Extension (SVE) supported by M4 P cores, enabling the same threads to be run on either core type.
It also explicitly mentions half of the processing units per core and lower clock speeds.
No they are not. "Efficiency" cores are generally tailored to do simple stuff well. Less floating point, more integers. Like file parsing, serving web pages, responding to network events, whatnot.
When you need heavy computation (encoding, scientific, etc.) P cores are your only choices.
As a result, server ecosystem will be fragmented a bit. For HPC and calculation stuff, P-Core heavy processors will be sold. For cloud and CRUD systems, E-Cores will dominate.
I mean, not having the SVE do not make them run all the workload of P cores, and this is what I said already? They are not of course different ISAs at the core, but they're not the same cores per se. When you are missing extensions, you can't address these cores when there are these instructions present. Forcing it, would kill your application with an illegal instruction error.
So heavy computational stuff is not the target of E cores, you need P cores for that.
The quoted sentence is poorly worded. The P and E cores are fully instruction set compatible. It isn't possible to meaningfully know ahead of time if the instructions will be used on any given core, and trapping along with a migration is expensive and needless. The M4 as a whole does not support the SVE/SVE2 extensions anywhere, which is what the article is saying in the given quote.
The M4, on all cores, does support the SME extension, which includes a subset of SVE instructions at a wider vector length (512-bits), optimized for throughput. SME instructions are handled by a separate accelerator coprocessor unit attached to each cluster, shared by a number of cores, and don't "exist" in the normal instruction pipeline (where e.g. 256-bit SVE2 instructions would be handled.) This was all true of the proprietary Apple AMX extensions in previous cores as well, as far as I'm aware.
Doesn't this happen in cycles? A specialized hardware appears, then it gets integrated into processor to make way for an even more specialized variant of the thing, rinse and repeat.
This external thing doesn't have to be "more powerful" per se. So, E cores are lower power helpers which are implemented back into the CPU in a slightly altered form.
Who or which prevents them from being dedicated to processing a network stream or just handling network thread of a service, making them "efficient accelerators" in a sense?
Very interesting to see. Efficiency (E) cores use only 7% of the energy that Performance (P) cores do performing the same task and take about 4x as long to do it.
So about 13.5x (23 J when run on P cores, and less than 1.7 J when run on E cores) the power to do about 4x the performance
So BeOS has a place in this universe
This may come largely from clock speed needing disproportionately more/less energy the higher/lower it goes.
This answer (based on an old source) even says power consumption increases with the cube of the clock speed: https://physics.stackexchange.com/posts/61937/revisions
Though this would mean a 4x in clock speed would consume 4^3=64 times as much energy, which is more extreme than what is observed here in the Apple chip. So either the clock speed/power relation is different now or the P cores do not actually have a 4x scaling in clock speed. Cache size etc may also play a role in performance.
Isn't that called Dennard scaling [1]?
[1]: https://en.wikipedia.org/wiki/Dennard_scaling
No, the "cube law" is related to varying clock speed rather than to varying transistor size.
Per the article, the clock speed difference is much smaller than the ~4x performance difference (4.5 GHz vs 2.6 GHz, i.e. 1.7x). So more than half of the performance advantage of the P cores has to come out of the uarch difference (wider structures etc.). Meanwhile there will be other factors besides clock frequency, e.g. the P cores might use a different cell library than the E cores.
Makes sense. This would suggest the difference in power draw may not mainly come from the clock frequency, since (1.7x)^3=5x, which is significantly less than the 13.5x in power draw.
I wonder what changes? In-order vs OOO? Less int/fp units? Are they fully instruction set compatible?
From the article about the instruction set:
> This is believed to be identical to ARMv9.2-A without Scalable Vector Extension (SVE) supported by M4 P cores, enabling the same threads to be run on either core type.
It also explicitly mentions half of the processing units per core and lower clock speeds.
No they are not. "Efficiency" cores are generally tailored to do simple stuff well. Less floating point, more integers. Like file parsing, serving web pages, responding to network events, whatnot.
When you need heavy computation (encoding, scientific, etc.) P cores are your only choices.
As a result, server ecosystem will be fragmented a bit. For HPC and calculation stuff, P-Core heavy processors will be sold. For cloud and CRUD systems, E-Cores will dominate.
From the article:
"[The E cores'] instruction set is the same as M4 P cores, ARMv9.2-A without its Scalable Vector Extension (SVE)"
I mean, not having the SVE do not make them run all the workload of P cores, and this is what I said already? They are not of course different ISAs at the core, but they're not the same cores per se. When you are missing extensions, you can't address these cores when there are these instructions present. Forcing it, would kill your application with an illegal instruction error.
So heavy computational stuff is not the target of E cores, you need P cores for that.
The quoted sentence is poorly worded. The P and E cores are fully instruction set compatible. It isn't possible to meaningfully know ahead of time if the instructions will be used on any given core, and trapping along with a migration is expensive and needless. The M4 as a whole does not support the SVE/SVE2 extensions anywhere, which is what the article is saying in the given quote.
The M4, on all cores, does support the SME extension, which includes a subset of SVE instructions at a wider vector length (512-bits), optimized for throughput. SME instructions are handled by a separate accelerator coprocessor unit attached to each cluster, shared by a number of cores, and don't "exist" in the normal instruction pipeline (where e.g. 256-bit SVE2 instructions would be handled.) This was all true of the proprietary Apple AMX extensions in previous cores as well, as far as I'm aware.
Curiously we had this argument before, roughly two decades ago.
Doesn't this happen in cycles? A specialized hardware appears, then it gets integrated into processor to make way for an even more specialized variant of the thing, rinse and repeat.
This external thing doesn't have to be "more powerful" per se. So, E cores are lower power helpers which are implemented back into the CPU in a slightly altered form.
Who or which prevents them from being dedicated to processing a network stream or just handling network thread of a service, making them "efficient accelerators" in a sense?
No, yes, yes
Isn't basically every modern cpu core OOO?