r/hardware • u/Geddagod • 2d ago
Discussion TSMC N3B/E "High Performance" Cores Compared
Area
Core Name | Node + Logic Lib + Metal Layer Count | Core without "L2 block" | Core without L2 SRAM arrays | Core | Hypothetical 4C "CCX" |
---|---|---|---|---|---|
Apple M4 P-Core | N3E + 3-2 + 19 DT, 17 mobile | 13.09 | - | 3.09 | 18.62 (16MB) |
Apple M3 P-core | N3B + 2-2? + 19 DT | 12.62 | - | 2.62 | 17.16 (16MB) |
Intel Lion Cove ARL | N3B + ? + ? | 52.62 | 23.12?/3.26 | 4.54 | 27.79 (12MB) |
Intel Lion Cove LNL | N3B + ? + 20 | 52.62 | 23.13?/3.26 | 4.26 | 23.62 (12MB) |
AMD Zen 5 Dense | N3E + ? + ? | 52.20 | 2.60 | 2.99 | 16.65 (8MB) |
Qcom Oryon V2L | N3E + 2-2 + 17 | 12.11 | - | 2.11 | 320.02 (24MB) |
Mediatek X925 | N3E + 3-2 + 15 | 1.85 | 2.38 | 2.93 | 3,416.52 (12MB) |
Xiaomi X925 | N3E + ? + 17 | 1.70 | 1.96 | 2.56 | 3,416.88 (16MB) |
Intel Skymont | N3E + ? + 20 | 11.09 | - | 1.09 | 411.63 (3MB) |
1: Not sure if cores with shared L2s have any, or nearly as much of the logic surrounding handling the L2 cache in the core itself. The cores with shared L2 blocks are just the core area.
2: The first number is without the L1 or L1.5 SRAM array, the second one is including that.
3: Likely an over estimate in comparison to the phone SOCs as the phone SOCs don't have the interconnect "fabric" included for their area, just the cores and the cache blocks themselves.
4: The L3 capacity here is completely arbitrary. The L3 capacities were chosen for ease of measurement given how the L3 slices were distributed.
5: A small but sizable chunk of the area of these cores seem to be from the CPL/clock section of the core, which may not have to be so large, but just are that large due to the geometry of the rest of the core.
Core | Revised Core without L2 Block Area |
---|---|
Intel Lion Cove | 2.33 |
AMD Zen 5 Dense | 2.08 |
I would not take any of the numbers are "hard numbers" but I do think the general ranking of the cores in area are fairly accurate.
The cores have the following configs:
Core | Cache Hierarchy (fastest to LLC) | Total Cache Capacity in a 4x CCX |
---|---|---|
Intel Lion Cove ARL | 48KB L0 + 192KB L1 + 3MB L2 + 3MB L3 | 24.96 MB |
Qcom Oryon V2L | 192KB L1 + 24MB SL2 | 24.768 MB |
Xiaomi X925 | 64KB L1 + 2MB L2 + 4 MB L3 | 24.256 MB |
Intel Lion Cove LNL | 48KB L0 + 192KB L1 + 2.5MB L2 + 3MB L3 | 22.96 MB |
Mediatek X925 | 64 KB L1 + 2MB L2 + 3MB L3 | 20.256 MB |
Apple M4 P-core | 128KB L1 + 16MB SL2 | 16.512 MB |
Apple M3 P-core | 128KB L1 + 16MB SL2 | 16.512 MB |
AMD Zen 5C | 48KB L1 + 1MB L2 + 2MB L3 | 12.192 MB |
Intel Skymont | 32KB L1 + 4MB SL2 + 4MB CL3 | 8.128 MB |
The total cache capacity is a bit of a meme since it doesn't include latency, but I do think some interesting things can be noticed regardless.
Performance
Specint, Cinebench 2024 is from Geekerwan, Skymont ARL and LNL is from the 265K and 258V from Huang
GB6 is from the Geekbench browser website
Core | Specint2017 | GB6 | Cinebench 2024 |
---|---|---|---|
Apple M4 P-Core | 132 | 148 | 124 |
Intel LNC ARL | 120 | 129 | 100 |
Apple M3 P-core | 113 | 118 | 99 |
Qcom Oryon V2L | 100 | 108 | - |
Xiaomi X925 | 100 | 104 | - |
Mediatek X925 | 100 | 100 | - |
Intel LNC LNL | 95 | 111 | 81 |
Intel Skymont ARL | 92 | - | - |
The difference between form factors:
Mobile to Laptop (Geekbench 6 scores)
From the Geekbench browser website
Core | Laptop | Ipad | Mobile |
---|---|---|---|
Apple M4 P-core | 113 | 107 | 100 |
Apple M3 P-core | 108 | - | 100 |
Laptop to Desktop (Geekbench 6 scores)
From Notebookcheck (averages used for mobile platforms)
Core | Desktop | Mobile |
---|---|---|
LNC ARL | 115 | 100 |
Zen 5 | 119 | 100 |
IPC differences in SpecInt2017 between form factors
David Huang
Core | Form Factor | Difference |
---|---|---|
Zen 4 | Desktop vs Mobile | 13% |
Zen 5 | Desktop vs Mobile | 12% |
While I do believe that the P-cores from AMD and Intel have to be designed to take advantage of the higher power budget that larger form factors afford, I also think that placing the mobile cores into those same form factors will also lead to an, at least, marginal perf improvement.
Zen 5C performance question marks
In the previous dense server core product, the Fmax of the server sku was dramatically lower than what the core could achieve in other products.
Core | Server | Mobile | OC'd |
---|---|---|---|
Zen 4 Dense | 3.1GHz | 3.7GHz | ~4GHz |
Zen 5 Dense | 3.7GHz | - | - |
This may be the case for Zen5C as well.
Power
Perf/watt curves (in Geekerwans videos) are the only way to get a full picture of power, but as a generalization:
Both the Apple M4 and M3 P-cores (as well as their implementation in the iphones) have better perf/watt than the X925 and Oryon-L.
The best case for Intel's P-core is that its perf/watt is as good as the X925 and Oryon-L. I think it is likely much, much worse.
According to Geekerwan, at a package power of ~5 watts, the M3 performs ~40% better than LNL. At around ~7-8 watts, the M4 performs closer to ~50% better.
Meanwhile we have David Huang showing that a M4 Pro at ~3.7 watts per core scoring ~33% better than a 9950x, and the 9950x has an outright better curve than the 265K LNC.
The gap is no where near as large compared to Apple's cores and the other ARM cores in the mobile space.
Power is, IMO, by far the hardest to really quantify, because one has to deal with how to measure "core only power" while trying to isolate the power of the rest of the SOC. Then there's also the problem of software measurements vs hardware measurements... I imagine only engineers at their respective companies would really know the power draw of a specific core.
Core Overview
Apple seems to have the best N3 cores in both perf and power.
The M4's P-core is pretty large by any standard, however it saves a bit of area due to it's cache hierarchy, making a hypothetical 4x CCX not that large. The shared L2 cache really only seems to be present in client mobile systems though so far, and I think it presents its own challenges in server. All the chips that use this hierarchy (Qcomm and Apple) both have very high mem bandwidth per core, which could be an issue scaling it up to server. The cache-per-core capacity, when all the cores are running and competing for the shared cache, would be lower than the competition.
The M3's and ARL's P-cores have pretty much the same perf, and similar area, however the M3 P-core is almost certainly dramatically more power efficient than a LNC P-core. Additionally, in terms of CCX area, LNC ends up being way, way larger thanks to the different cache hierarchy.
Zen 5 dense is pretty interesting, as I really, really doubt it's Fmax only goes up to 3.7GHz. Performance is an unknown, as is power, however from purely an area perspective, it seems pretty comparable to the ARM P-cores. Meaning to get comparable performance to those cores, Zen 5C would have to have an Fmax of ~4.7GHz, a 30% boost from what it clocks in server parts. Which... I mean isn't extremely unexpected ig considering that a similar percent boost was seen with Zen 4 dense in server and OC'd Zen 4, however it still seems pretty hard to believe.
To be fair to the x86 cores though, they have 256/512 bit vector width, unlike the ARM cores listed only having 128 bit vector width. This does really seem to cost a decent bit of area, especially for Zen 5. AFAIK, Zen5C in server has the full Zen 5 AVX-512 implementation, and we have already seen how much area can be shaved off Zen 5 from just choosing not to go for the full AVX-512 implementation:
Core | Core |
---|---|
Zen 5 DT (N4P) | 4.46 (+12%) |
Zen 5 MBL (N4P) | 3.99 |
Qualcomm's custom cores honestly don't seem like they afford any sort of distinct advantage over ARM's stock cores on N3 (being implemented by different companies). Perhaps I am missing something, but there seems to be no meaningful area advantage (even considering the larger cache capacity), and no meaningful performance or power advantage. I also think Qualcomm's cache hierarchy transferring over to server, unchanged, would be pretty unique in the server space, seeing how no other major or even relatively smaller companies seem to be offering that sort of setup in servers. Maybe Qualcomm's P-cores would scale up at higher power better than ARM's cores in a laptop/desktop form factor? It is interesting to see Qualcomm choosing to presumably sacrifice IPC for greater clocks vs the stock ARM cores, perhaps they think perf at even more power would be greater, or maybe Vmin is lower?
It's wild to see Mediatek's X925 both end up being larger and slower than Xiaomi's implementation. No idea why or how. In terms of the comparison to the rest of the cores though, they aren't nearly as powerful as the rest of the P-cores, but they are also a decent bit smaller than the other P-cores as well. The lower performance may as well be due to the fact that they are all in a phone form factor, so it might be pretty interesting to see how the X925 in Nvidia's upcoming DG10 chip performs.
17
u/Geddagod 2d ago
Credit to this post for the idea, die shots measurements are found by pixel peeping from Kurnal's die shots on his twitter, the Geekerwan video for the perf/watt curves for all the ARM chips are here, numbers from David Huang are from here and his twitter. Metal Layer and lib type are from Techinsights (non paywalled).
18
u/xternocleidomastoide 2d ago
Just and FYI, IP and structure sizing within die, for latest gen SKUs tend to be rather confidential data. So you should take the area data with a huge grain of salt, as the error will be tremendous.
Usually the only data we can sort of be confident about is total SoC power consumption (ant that is only if the reviewer/study had the equipment to isolate socket power, for example. otherwise re vert to full system power data, which may include DDR, UFS/SSD, Display, etc) and performance data reported from benchmarks.
6
u/hwgod 2d ago
IP and structure sizing within die, for latest gen SKUs tend to be rather confidential data
It's very clear how big the core is just from measurement. That is not confidential data on shipping parts.
0
u/xternocleidomastoide 1d ago
It's very clear you have no direct experience with this issue
8
u/Just_Maintenance 2d ago
OP got the sizes from die shots.
It's not trivial to tell what is what, but the size should be accurate.
20
u/xternocleidomastoide 2d ago
That's not accurate. It is estimate, at best.
E.g. The shape of the actual IP is not disclosed, usually, unless it is an old design. Have to do a best guess of what's what and where. IPs/Structures having complex geometries complicates area calculation further. Die imaging having its own error. Etc, etc. So there are tremendous cumulative errors at play.
20
u/Just_Maintenance 2d ago
There is no laws of physics that says that estimates can't be accurate.
And yeah you said the same thing I said. It's not trivial to tell what is what, although its not impossibly hard either. You know a CPU has 12 cores? find a structure that is duplicated exactly 12 times, extremely good chance that's the core.
And once you identified the block, have literal photos of the dies, and have the size of the die, its trivial to get the size of a block with reasonably accuracy.
The hard variable here is mostly cache, fabrics, etc., stuff that's outside the core.
-6
u/xternocleidomastoide 2d ago edited 2d ago
Ah, OK. Glad you were able to solve with a couple of hand waves a very serious technical challenge/problem. Send you resume to a competitive analysis group of your choice.
6
u/hwgod 2d ago
hand waves a very serious technical challenge/problem
This is not an actual challenge for anyone in the industry. Especially not at a core/L2 granularity.
0
u/Professional-Tear996 2d ago
The OP is not in the industry. He used to be a university student goofing around because he has access to the IEEE Xplore library.
6
u/ResponsibleJudge3172 1d ago
We see the same technique by analysts like Lokuza, semi analysis, semiaccurate etc to great effect. Knowing that client chips are ring bus, you can clearly see where P cores and core clusters are and where L 3 cache is for example
-2
u/xternocleidomastoide 1d ago
Sure, the sausage looks ridiculously simple to make when looking at it from the store's window.
3
u/Professional-Tear996 2d ago
Yeah, just because the person taking the die shot or the person trying to label the die shot thinks this structure is the BTB, for example, doesn't mean that it is the BTB with a high level of certainty.
5
u/xternocleidomastoide 2d ago
Yeah. A lot of people don't realize how proprietary a lot of this information really is.
It's perfectly fine to make speculative analyses. As long as it is made clear the lack of validation/uncertainty on the approach, and the data being purely guesses/estimates.
0
u/Professional-Tear996 2d ago
Also from the ARL die shots, it is clear that the last pair of P-cores on the right on either side of the ring have a very different core and L3 layout than the other six.
So what is the hypothetical "4-core CCX" the OP is talking about in the context of ARL? Does it include these "different" cores if we are talking about the area?
4
u/hwgod 2d ago
it is clear that the last pair of P-cores on the right on either side of the ring have a very different core and L3 layout
They do not. The core is identical. Actually look at a picture.
-1
u/Professional-Tear996 2d ago
The core is different around the L3/ring area. I've seen the pictures.
→ More replies (0)-2
u/xternocleidomastoide 1d ago
Even when the cores are identical, the boundaries are not well defined from just the die shots (which have their own lack of resolution issues to begin with), for example. There are ridiculously uneducated calls being made about what the "core" is.
Specially, IPs nowadays are not "square" they are somewhat complex geometries.
The best you can do from a die shot, without any input from the designer, is making some very high level guess and maybe notice some trends. But that's just about it.
Even within the same organization, you are not going to get intra die sizing details unless you have the proper authorization/involvement level.
I have read a few of these "studies" on cores I have actually worked on, and it is hilarious how off the authors are either grossly overestimating or underestimating the areas, making up arbitrary "cores," or completely ignoring the less "obvious" chunks/components.
The problem as usual, it is the Dunning Kruger aspect of this: knowing so little about this thus not knowing how little they know.
These discussions are hilarious exercises of blind people trying to describe an elephant, and getting very triggered when a person, who has actually seen an elephant, comes along. LOL.
10
u/Professional-Tear996 2d ago edited 2d ago
Lol what is this? Why is GB6 score only 3 digits? SPECint scores being 3 digits means it's the multicore score i.e. n-copy.
And GB includes both integer and FP workloads. SPECint is well, only integer.
EDIT: I'm also not a fan of power-performance curves in the style of Geekerwan. Those curves are meaningless unless the interpolation error in fitting the actual measured data points are much less than the error in measurement of the data points themselves.
2
3
u/team56th 2d ago edited 2d ago
What my uneducated eyes and brain see is that while Apple’s engineering of such big chips and get that perf/watt is impressive, AMD is the one that really blows me away. It might be the most optimal balance between perf, power efficiency(wattage) and cost efficiency(chip size) while using this design across 3 separate product lineups (even more when you count C cores into consideration)
2
u/Tman1677 2d ago edited 1d ago
It is crazy how if these numbers are as good as it says here (big if) how much worse the overall end user experience is compared to Macs. It goes to show how much more comes down to other things like software and hardware integrations even more so than the CPU design.
Edit: clarification
1
1
u/blueredscreen 1d ago
These are the kinds of comparisons only Cheetos gamers LARPing as silicon engineers could produce. Don't even know where to start with all that.
1
u/Exist50 1d ago
These are the kind of comparisons companies do themselves.
-2
u/blueredscreen 1d ago
These are the kind of comparisons companies do themselves.
They have the raw data directly. They quite positively don't engage in this nonsense.
1
u/Exist50 1d ago
Not for their competitors they don't. They take a microscope, current probes, etc, and they measure. Same as anyone else can, and just what the OP's sources did.
I'm not sure why you find the idea so absurd.
-1
u/blueredscreen 1d ago
Not for their competitors they don't. They take a microscope, current probes, etc, and they measure. Same as anyone else can, and just what the OP's sources did.
They don't extrapolate mumbo jumbo out of particular numbers. You can only go so far.
13
u/DerpSenpai 2d ago edited 2d ago
Your calculations are wrong for the "4c CCX" config area size for QC. their logic is that the L2 is shared up to 6 cores in a single CCX like structure, so they have 12MB for 2 cores on the 8 Elite, but on the X Elite it's going to 6 cores sharing those 12MB. X Elite Gen 2 for example it's 3x 12MB L2s, each for 1 CCX, each with 6 cores. 1 of those CCX is with Oryon M and not L