Intel 11th Gen Core RocketLake (i9-11900K) Review & Benchmarks – AVX512 Performance

What is “RocketLake”?

It is the desktop/workstation version of the true “next generation” Core (gen 10+) architecture – finally replacing the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, etc.). It is a combination of the “IceLake (ICL)” CPU cores launched about a year go and the “TigerLake (TGL)” Gen12 XE graphics cores launched recently.

With the new core we get a plethora of new features – some previously only available on HEDT platform (AVX512 and its many friends), improved L1/L2 caches, improved memory controller and PCIe 4.0 buses. Sadly Intel had to back-port the older ICL (not TGL) cores to 14nm – we shall have to wait for future (desktop) processors “AlderLake (ADL)” to see 10nm on the desktop…

  • 14nm+++ improved process (not 10nm)
  • Up to 8C/16T “Cypress Cove” cores aka 14nm+++ “Sunny Cove” from ICL – Claimed core IPC uplift of +19%
  • AVX512 and all of its friends (1x FMA Unit)
  • Increased L1D cache to 48kB (50% larger)
  • Increased L2 cache to 512MB (2x as large)
  • PCIe 4.0 (up to 32GB/s with x16 lanes) – 20 (16+4 or 8+8+4) lanes
  • Thunderbolt 3 (and thus USB 3.2 2×2 support @ 20Gbps) integrated
  • Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)
  • TVB (Thermal Velocity Boost) for Core i9 only
  • ABT (Adaptive Boost Technology) for Core i9-K(F) only

The biggest change is support for AVX512-family instruction set, effectively doubling SIMD processing width (vs. AVX2/FMA) as well as adding a whole host of specialised instructions that even the HEDT platform (SKL/KBL-X) does not support yet:

  • AVX512-VNNI (Vector Neural Network Instructions, dlBoost FP16/INT8) e.g. convolution
  • AVX512-VBMI, VBMI2 (Vector Byte Manipulation Instructions) various use
  • AVX512-BITALG (Bit Algorithms) various use
  • AVX512-VAES (Vector AES) accelerating block-crypto
  • AVX512-GFNI (Galois Field) – e.g. used in AES-GCM
  • SHA HWA accelerating hashing (SHA1, SHA2-256 only)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA, etc. support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L1D is 50% larger and L2 doubles (2x) but L3 has not been increased. We will measure latencies in a future article.

From a security point-of-view, RKL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS, etc.) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). RKL is also not affected by the JCC errata that needs mitigatation through software (compiler) changes on older processors.

The memory controller supports higher DDR4 speeds (up to 3200Mt/s) while the cache/TLB systems have been improved that should help both CPU and GPU performance (see corresponding article) as well as reduce power vs. older designs using LP-DDR3. Again we will measure bandwidth and latencies in a future article.

PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) and to ~32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.

On the desktop – while the Intel is launching new 500-series motherboards, RKL should be compatible with 400-series boards with a BIOS update. Just as with AMD, PCIe 4.0 may only be available on 500-series boards.

As the i9 has the same number of cores/threads as the i7, it has two technologies enabled to boost its clocks:

  • TVB – when temperature is below Tvb (~70C), Turbo clock is boosted by 100MHz throughout
  • ABT – when power is available, overall Turbo matches 2C setting

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Intel with competing architectures as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

Specifications Intel Core i9 11900K 8C/16T (RKL) AMD Ryzen 9 5900X 12C/24T (Zen3) Intel Core i9 10900K 10C/20T (CML) Intel Core i9 7900X 10C/20T (SKL-X) Comments
Arch(itecture) Cypress Cove / RocketLake Zen3 / Vermeer Comet Lake Skylake-X Not the very latest arch.
Cores (CU) / Threads (SP) 8C / 16T [-20%] 2M / 12C / 24T 10C / 20T 10C / 16T Two less cores.
Rated Speed (GHz) 3.5 [-6%] 3.7 3.7 3.3 Base clock is lower
All/Single Turbo Speed (GHz)
4.8 – 5.3 [+2%] 4.5 – 4.8 4.9 – 5.2 4.0 – 4.3 Turbo is 2% higher.
Power TDP/Turbo (W)
125 – 250 105 – 135 125 – 155 140 – 308 TDP is the same on paper.
L1D / L1I Caches 8x 48kB 12-way [+50%] / 8x 32kB 8-way 12x 32kB 8-way / 12x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way L1D is 50% larger.
L2 Caches 8x 512kB 16-way [2x] 12x 512kB 16-way 10x 256B 16-way 10x 1MB 16-way L2 has doubled.
L3 Caches 16MB 16-way [=] 2x 32MB 16-way 20MB 16-way 13.75MB 11-way L3 is the same
Microcode (Firmware) 06A701-3C* (ver 60*)
8F7100-1009 06A505-C8 (ver 214) 065504-69 (ver 105) Revisions just keep on coming.
Special Instruction Sets
AVX512, VNNI, SHA, VAES AVX2/FMA, SHA AVX2/FMA AVX512 More AVX512!
SIMD Width / Units
512-bit (1x FMA Unit)
256-bit 256-bit 512-bit (2x FMA Units)
Widest SIMD units, but single FMA.
Price / RRP (USD)
$539 [+8%]
$549 $499 $999
A little bit more expensive.

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

Note: We (SiSoftware) claim copyright over the scores (benchmark results) posted to the Ranker. Please see:
Privacy: Who owns the data (scores) posted to the Ranker?

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets. “Rocketlake” (RKL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel Core i9 11900K 8C/16T (RKL) AMD Ryzen 9 5900X 12C/24T (Zen3) Intel Core i9 10900K 10C/20T (CML) Intel i9 7900X 10C/20T (SKL-X) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 545 [+6%] 589 512 397 RKL is 6% faster than CML.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 551 [+16%] 594 477 392 A 64-bit integer workload RKL is 16% faster.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 285 [-14%] 388 330 264 With floating-point, RKL is 14% slower
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 239 [-12%] 324 273 221 With FP64 nothing much changes.
With legacy integer RKL manages to beat CML (6-16% faster) even with its 2 extra cores – an impressive; with floating-point it is about 13% slower which is acceptable.

Zen3 with 50% more cores naturally wins all tests – but the integer tests are not that much higher; floating-point tests do show the power of Zen3’s FPU cores – though “only” ~35% faster than RKL. If RKL had 10C then Zen3 might have had real competition.

Note that due to being “legacy” none of the benchmarks support AVX512; while we could update them, they are not vectorise-able in the “spirit they were written” – thus single-lane AVX512 cannot run faster than AVX2/SSEx.

BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1,688* [+14%] 2,000 1,475 1,590* With AVX512 RKL is 14% faster.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 569* [-3%] 805 589 548* With a 64-bit AVX512 integer RKL is slower.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 236* [+2.1x] 157 109 125* Using 64-bit int to emulate Int128 RKL is over 2x faster due to AVX512.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1,774* [+31%] 2,000 1,358 1,930* In this floating-point vectorised test RKL is 31% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 998* [+28%] 1,190 778 1,210* Switching to FP64 SIMD AVX512 code, RKL is 28% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 43.68 [+20%] 49.47 36.53 50* Using FP64 to mantissa extend FP128 RKL is still 20% faster.
With heavily vectorised SIMD workloads RKL can leverage its AVX512 support to dominate CML despite its 2 extra (10C) cores and also matches the old SKL-X with its 2 extra cores (also using AVX512 but with 2x FMA units).

However, Zen3 with 50% more cores (12C) is tough to beat and not even AVX512 manages to lift RKL that high – overall. We can see that if it had one more AVX512-FMA unit (like SKL-X), RKL might have been able to do it – but with a single unit it was a big ask.

Note:* using AVX512 instead of AVX2/FMA.

BenchCrypt Crypto AES-256 (GB/s) 22.95 [+30%] 18.74 17.69 34 Memory bandwidth rules here.
BenchCrypt Crypto AES-128 (GB/s) 22.94 [+30%] 18.7 17.68 34 No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 33.56*** [+2x] 37 16.24 26*** With AVX512, RKL is 2x faster.
BenchCrypt Crypto SHA1 (GB/s) 38.84*** [+36%] 39 28.61 39*** Less compute intensive SHA1.
BenchCrypt Crypto SHA2-512 (GB/s) 22.88*** 16.56 21*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and RKL using much faster memory manages to beat its competition. Despite having SHA HWA, RKL using AVX512 is 2x faster than CML but cannot beat Zen3 SHA HWA. With its 12C, Zen3 needs even faster memory to feed all the cores in streaming tests.

* using VAES (AVX512 VL) instead of AES HWA. [note we need much faster memory for VAES to beat AES HWA]

** using SHA HWA instead of multi-buffer AVX2. [note multi-buffer AVX2 is slower than SHA hardware-acceleration]

*** using AVX512 B/W [note multi-buffer AVX512 is faster than using SHA hardware-acceleration]

BenchFinance Black-Scholes float/FP32 (MOPT/s) 379 [-7%] 542 409 340 Black-scholes is unvectorised and compute heavy.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 332 [+1%] 449 329 289 Using FP64 RKL is 1% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 81.83 [-1%] 228 82.27 71 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 85.19 [+4%] 120 82.07 68 With FP64 code RKL is 4% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 264 [-14%] 427 307 249 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 117 [-6%] 182 124 104 Switching to FP64 RKL is 20% faster.
With non-SIMD financial workloads, RKL with just 8C is at a big disadvantage – but is still competitive against CML (4-6%) though naturally it cannot beat Zen3 with 12C. As before Intel’s CPUs rely heavily on SIMD instruction sets to win.

Still, it is more likely that the GP-GPU will be used for such workloads today.

BenchScience SGEMM (GFLOPS) float/FP32 460* [-20%] 815 575 685* In this tough vectorised algorithm
BenchScience DGEMM (GFLOPS) double/FP64 210* [-2%] 377 215 231* With FP64 vectorised code, RKL is 2% slower.
BenchScience SFFT (GFLOPS) float/FP32 22* [-13%] 26.87 25.43 39* FFT is also heavily vectorised but memory dependent .
BenchScience DFFT (GFLOPS) double/FP64 14.43* [+16%] 13 12.43 19.6* With FP64 code, RKL is 16% faster.
BenchScience SNBODY (GFLOPS) float/FP32 616* [-6%] 856 657 592* N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 190* [-14%] 338 222 179* With FP64 RKL is 14% slower.
With highly vectorised SIMD code (scientific workloads), RKL can show the power of AVX512 units – though here we seem to have some regressions as it cannot beat CML with its extra cores and cannot beat Zen3 either.

* using AVX512 instead of AVX2/FMA3

Neural Networks NeuralNet CNN Inference (Samples/s) 60.08* [-13%] 133 68.95 54.19* Despite AVX512, RKL is 13% slower.
Neural Networks NeuralNet CNN Training (Samples/s) 8.21* [+2x] 31.23 4.26 9.16* RKL is 2x faster but nowhere near Zen3.
NeuralNet RNN Inference (Samples/s) 102* [-8%] 205 111 71.81* Despite AVX512, RKL is 8% slower.
NeuralNet RNN Training (Samples/s) 4.59* [+2x] 4.62 2.52 6.08* RKL is again 2x faster matching Zen3.
Despite using AVX512, RKL just manages to match CML with its 2 extra cores but cannot match Zen3 at all. We need some further optimisations for the AVX512 code-path to make it worthwhile on RKL.

* using AVX512 instead of AVX2/FMA (not using VNNI yet)

CPU Image Processing Blur (3×3) Filter (MPix/s) 5,210* [+56%] 2,000 3,337 4,440* In this vectorised integer workload RKL is 56% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 2,439* [+85%] 1,270 1,318 2,000* Same algorithm but more shared data RKL is now 85% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 1,246* [+84%] 861 676 1,000* Again same algorithm but even more data shared 84% faster.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,984* [74%] 1,390 1,137 1,560* Different algorithm but still vectorised workload RKL 74% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 305* [+3x] 160 102 217* Still vectorised code RKL is 3x faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 84.37* [+50%] 52.88 56.43 68* Similar improvement here of about 50%.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 4,728* [=] 1,480 4,724 3,440* With integer workload, RKL matches CML.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 1,022* [+28%] 622 800 777* In this final test again with integer workload RKL is 28% faster.
Thanks to AVX512 (and friends) RKL really flies through this benchmark with massive gains that are up to 85% over CML despite its 2 more cores. It also blows both SKL-X and even Zen3 with 50% more cores.

Intel’s salvation is SIMD and AVX512. Long term, most software will be updated to use AVX512 which will favour RKL.

* using AVX512 instead of AVX2/FMA

Aggregate Score (Points) 13,000* [+15%] 14,780 11,300 13,090* Across all benchmarks, RKL is 15% faster.
RKL i9 (8C/16T 11900K) is 15% faster overall than CML i9 (10C/20T 10900K) which is a decent result though nowhere near to what we saw when testing RKL i7 (that is 40% faster than CFL i9 but with the same number of cores).

Naturally, with 12C Zen3 (5900X) is the fastest but not by as much as you may expect – it is just 15% faster still.

Note*: using AVX512 not AVX2/FMA3.

Price/RRP (USD) $539 [+8%] $549 $499 $999 Price has gone up a bit by 8%.
Price Efficiency (Perf. vs. Cost) (Points/USD) 24.11 [+7%] 26.92 22.64 12.69 RKL is 7% more efficient.
Despite its modest performance increase, due to the slight price increase – the overall value is just 7% better over CML and its 10C, which is why we can be a bit disappointed. The increase is so slight that it all depends on the actual market prices, if the CML (i9 10900K) is discounted then it will be worth buying.

Zen3 is only a bit more expensive and packs 50% more cores which makes it a clear winner. But lack of availability and price gouging by sellers means the difference can be eroded especially should Intel decide to discount RKL in the run-up to ADL.

Power/TDP (W) 125 – 250W [=] 105 – 155W 125 – 155W 140 – 308W TDP is the same – at least on paper
Power Efficiency (Perf. vs. Power) (W) 104 [+15%] 140 90.4 90.57 Based on TDP, RKL is 15% more efficient.
If we go by listed TDP which matches CML – then naturally RKL is 15% more efficient, same as performance increase. But reports show that the actual power drawn is much higher thus if efficiency is what you want then old CML is the way to go.

Zen3 though is 40% more efficient which pretty much blows everything out of the water even at listed TDP. It is clear that Intel has a long way to go to match AMD in power efficiency.

Across SIMD tests, we see RKL i9 (11900K) with AVX512 about 15% faster than CML i9 (10900K) which more cores (10C vs 8C) and also manages to beat SKL-X with the same number of cores and AVX512! You can see just how much RKL has improved over SKL at the same node (14nm) over no less than 5 generations.

However, at the top end AMD’s Zen3 how fields up to 2x more cores (e.g. 5950X with 16C) in a desktop socket at lower TDP (105W) which makes it pretty much unbeatable.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Summary: A good upgrade (~15% SIMD/AVX512 improvement) : 7/10

Note2: The microcode (aka CPU firmware) used in the test as per above is version 3C (version 60); older or newer versions seem to exhibit different performance. We will update the review when we obtain our sample and run our own tests.

Note3: All benchmarks run are using the latest supported instruction sets – i.e. AVX512; Sandra does allow you to disable it and run AVX2/FMA3 or even AVX, SSE4, SSE2. Once we get our sample we will show the results using AVX2/FMA3.

We’ve really been waiting way too long for this – endless Skylake (SKL) derivatives (Gen 6, 7, 8, 9 and 10) then finally IceLake (ICL) and TigerLake (TGL) but for mobile (ULV) only. RocketLake (RKL) is not quite what we expected, as it’s not on 10nm and also using the older ICL cores (not TGL), but at least compatible with the 400-series (LGA 1200) platform. With AMD making steady improvements with Ryzen (series 2000, 3000 and now 5000), the top end i9 10900K CometLake (CML) could not really keep up: we really needed a new contender from Intel.

RocketLake (RKL) i9 (11900K) reduces the number of cores (8C vs. 10C of 10900K) and also price-wise goes against competition with far more cores (12C of Ryzen 9 5900X). While AVX512 + arch improvements (IPC, larger caches, TLBs, etc.) make it about 40% faster than CFL/CML – the core discrepancy is an issue, unlike say RKL i7-11700K that goes against competition with the same core count.

Thus, overall it can only beat 10C CML by 15% and is just 12% slower than Zen3 which is competitive. With the latest microcode, it also manages to be about 15% faster than RKL i7-11700K but naturally at higher cost. It also requires a much better cooling solution especially for AVX512 compute workloads like the ones Sandra runs.

We have been waiting for PCIe 4.0, already supported by AMD 2 generations ago, and by now by modern GP-GPUs and NVMe SSDs; perhaps not quite needed, but with mass-market gaming consoles (Sony PS5, Microsoft XBox SS/XS) using NVMe/PCIe4 storage – it became a bit of an embarrassment for top-end PCs to lack it.

We also get native USB 3.2 2×2 at 20Gbps and Thunderbolt 3 (depending on the board) which can greatly help both external storage (especially NVMe or RAID arrays) and even network connection (through Thunderbolt) faster than the meagre 1Gbps (Gigabit) Ethernet… [note some boards include 2.5Gbps Ethernet, but sadly not all]

We’re not mentioning the integrated graphics here (XE-LP) as unless you are using it as a (pure) compute workstation/server – you really need to pair it with top-end GP-GPU.

While the i7-11700K is keenly priced ($399), the i9 is slightly more expensive for not a lot more performance which makes it less value. Unless you really want top-performance at any cost – the i7 is the one to go for. Especially as it has the same number of cores and features (except TVB and ABT that are speed related). This generation – the i9 is not quite what it used to be.

Summary: Good: 7/10

The Future: AlderLake (ADL) Hybrid (big.LITTLE)

As we have seen in our TigerLake (TGL) benchmarks, TGL improves significantly over ICL (and thus RKL using the same cores at 14nm) – thus 10nm ADL “big Core” is likely to be much faster than RKL and at much lower power. So if it all goes to plan, ADL will be the Core we are all looking for…

However, ADL is a hybrid processor, consisting of “big Core” cores + “LITTLE Atom” cores – and these kind of hybrid architectures are a massive pain in the proverbial for (software) developers. There is a massive work being done underneath the hood in Sandra to support such hybrid architecture (detection/scheduler/benchmarks/UI) and it will live and die based on how good the Windows scheduler will manage cores. The future will certainly be interesting…

Further Articles

Please see our other articles on:

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

Note: We (SiSoftware) claim copyright over the scores (benchmark results) posted to the Ranker. Please see:
Privacy: Who owns the data (scores) posted to the Ranker?

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Tagged , , , , , . Bookmark the permalink.

Comments are closed.