Windows Arm64 on Qualcomm Snapdragon 7c Performance

What is “Qualcomm’s Snapdragon”?

It is a family of ARM processors/SoCs – analogous to the Intel’s “Core” in the x86 world. It now spans 8 series going back years and is used for everything from laptops, tablets, phones and even smart-watches!

Unlike other ARM SoC manufacturers (Samsung, Huawei, MediaTek, etc.), Qualcomm has an ARM design licence (like Apple) and modifies the standard Cortex ARM designs and re-brands them “Kryo”. In big/LITTLE (DynamIQ) designs – the big cores are branded “Gold” and LITTLE cores “Silver”. In the latest 3-core types design – the top-end cores are branded “Prime”.

Qualcomm has designed custom “Mobile PC/Compute Platform” SoCs for Arm64 Windows 10/11 – that include the 7c series reviewed here and the updated 8cx (also known as Microsoft SQ). Older versions powered previous versions of Windows RT as well as Windows Phone, Windows CE, etc.

Using these ARM SoC family has allowed OEMs to launch lower cost laptops and tablets – e.g. Samsung Galaxy Book, Lenovo, ECS (EliteGroup Compute Systems), etc. that are similar in cost to Intel’s Atom devices but including features like 4G/LTE that are not normally found at this price range. While somewhat more expensive than Chromebooks – they run the full version of Windows 10/11 (even Pro/Enterprise) and can run 32-bit x86 applications under emulation (WoW – Windows on Windows).

Microsoft has used the high-end version 8cx (rebranded SQ) to launch an alternative Surface Pro X line, at slightly lower cost than the x86 Surface Pro line but also including 4G/LTE as standard.

What is “Windows Arm64”?

It is the 64-bit version of client Windows 10/11 for Arm64 (AArch64) devices – analogous to the current x64 Windows 10/11 for Intel & AMD CPUs. While “desktop” Windows 8.x has been available for ARM (AArch32) as Windows RT – it did not allow running of non-Microsoft native Win32 applications (like Sandra) and also did not support emulation for current x86/x64 applications.

We should also recall that Windows Phone (though now dead) has always run on ARM devices (and other architectures) and was first to unify the Windows desktop and Windows CE kernels. Windows was first ported to 64-bit with the long-dead Alpha64 (Windows NT 4), then Itanium IA64 (Windows XP 64), x64 (AMD64/EM64T) and now Arm64 – which shows the versatility of the NT micro-kernel.

By contrast, Windows 10/11 Arm64 is able to run native AArch64 applications (compiled for Arm64 like Sandra) as well as emulating 32-bit x86 and ARM (AArch32) applications through “WoW” (Windows on Windows emulation). But *not* x64 applications! Ouch!

While Arm64 Windows 10/11 includes in-box drivers for many peripherals/devices – it may not have a driver for very new peripheral/device; some manufacturers may not (even) provide any support for Arm64. For some devices, standard in-box “class” drivers (e.g. NVMe controller/SSD, AHCI controller/SSD, USB controller, keyboard, mouse, etc.) do work but otherwise a custom driver is required. [Note there has never been any emulation for kernel drivers, in any version of Windows]

Qualcomm 7c SoC Details

  • Qualcomm Snapdragon SC7180
  • ARMv8.2-A (Arm64) 64-bit cores
  • 2x Kryo 468 Gold “big-cores” (based on Cortex-A76) @ 2.4GHz
    • Arm64 v8.2 4-wide OoO design
    • 64kB + 64kB L1, 128-512kB L2 per core
    • ~25-33% performance increase over Cortex A75
      • Cortex A75 provided ~20% performance increase over Cortex A73
      • Cortex A73 provided ~30% performance increase over Cortex A72 (e.g. Raspberry Pi 4B cores)
  • 6x Kryo 468 Silver “little-cores” (based on Cortex-A55) @ 1.8GHz
    • Arm64 v8.2 2-wide in-order
    • 32kB + 32kB L1, 64kB-256kB L2 per core
    • ~15% power efficiency, ~18% performance increase over Cortex A53 (e.g. Raspberry Pi 3B(+) cores)
  • 8nm process (Samsung)
  • TDP ~7W
  • Unified 1MB L3 / LLC cache
  • System Cache 1MB (also used by GP-GPU)
  • Memory LP-DDR4X 2x 2133 (4266) 2x 16-bit  ~17GB/s
  • Qualcomm Adreno 618 GP-GPU
    • 1CU 256SP 700MHz core clock
    • DirectX 12, 11 support
    • No OpenCL support in the driver. Boo!

Compared to Intel SoCs, we have more “real” cores (8) but the same number of threads (8T) as ARM has not implemented SMT (Hyperthreading) as Intel and AMD have. Otherwise, we have a pretty modern process (8nm), decent clocks (in excess on 2.4GHz for big cores), but small L3/LLC cache (1MB) and low-bandwidth LP-DDR4X memory (just 2 channels of 16-bit).

ARM Advanced Instructions Support

SIMD: As in the x86 world, ARM supports SIMD instructions called “NEON” operating on 128-bit width registers – equivalent to SSE2. However, there are 32 of them – while x64 SSE/AVX only provide 16 – until AVX512 which also provides 32. In most algorithms we can use them in batches of 4 – effectively making them 512-bit!

Unlike newer cores, the older A7X cores do not support SVE (“Scalable Vector eXtensions” – the successor of NEON); current designs are still just 128-bit width although they do provide more flexibility especially when implementing complex algorithms.

Crypto: Similar to x86, ARM does provide hardware-accelerated (HWA) encryption/decryption (AES, SM3) as well as hashing (SHA1, SHA2, SHA3, SM4). Unlike, say, the Raspberry Pi SoCs, Qualcomm naturally includes crypto extensions. Yay!

Virtualisation: The ARM cores do have hardware virtualisation – and here Qualcomm has the necessary firmware functionality to enable Hyper-V on Windows 11 to run corresponding ARM/Arm64 virtual machines. Or perhaps emulate x86 through QEMU? Both VmWare and Proxmox include support should you wish to use different virtualisation technology. Yay!

Security Extensions: ARM cores since ARMv6 (!) have included TrustZone secure virtualisation. AMD’s own recent CPUs all contain an ARM core (Cortex A5?) supporting TrustZone handling the security functionality (e.g. PSP / firmware-emulated TPM). See our article
Crypto-processor (TPM) Benchmarking: Discrete vs. internal AMD, Intel, Microsoft HV.

TPM: in order to support Windows 11, Qualcomm does include TPM support. Thus you can use BitLocker disk encryption as well as other security-based virtualisation features (like Core Isolation / Memory Integrity). Yay!

Changes in Sandra to support ARM

As a relatively old piece of software (launched all the way back in 1997, yep we’re really old), Sandra contains a good amount of legacy but optimised code, with the benchmarks generally written in assembler (MASM, NASM and previously YASM) for x86/x64 using various SIMD instruction sets: SSE2, SSSE3, SSE4.1/4.2, AVX/FMA3, AVX2 and finally AVX512. All this had to be translated in more generic C/C++ code using templated instrinsics implementation for both x86/x64 and ARM/Arm64.

As a result, some of the historic benchmarks in Sandra have substantially changed – with the scores changing to some extent. This cannot be helped as it ensures benchmarking fairness between x86/x64 and ARM/Arm64 going forward.

For this reason we recommend using the very latest version of Sandra and keep up with updated versions that likely fix bugs, improve performance and stability.

CPU core performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the Arm64 processors with x86/x64 processors of similar vintage – all running current Windows 10/11, latest drivers.

Specifications Microsoft SQ2 Raspberry Pi 4B Intel Core i5 6300 Snapdragon 7c Comments
Arch(itecture) Kryo 495 Gold (Cortex A76) + Kryo 495 Silver (Cortex A55) 7nm Cortex A72 16nm Skylake-ULV (Gen6) Kyro 468 Gold (Cortex A76) + Kyro 468 Silver (Cortex A55) 8nm big+LITLE cores vs. SMT
Launch Date
Q3 2020 2019 Q3 2015 2020 Older design but still recent
Cores (CU) / Threads (SP) 4C + 4c (8T) 4C / 4T 2C / 4T 2C + 6c / 8T Same number of threads, but fewer big cores
Rated Speed (GHz) 1.8 1.5 2.5 1.8 Similar base clock
All/Single Turbo Speed (GHz)
3.14 2.0 3.0 2.4 Turbo could be higher
Rated/Turbo Power (W)
~5-7 ~4-5 15-25 ~7 Much lower TDP, 1/3x Intel
L1D / L1I Caches 4x 64kB | 4x 32kB 4x 32kB 2-way | 4x 48kB 3-way 2x 32kB | 2x 32kB 2x 64kB | 4x 32kB Similar L1 caches
L2 Caches 4x 512kB | 4x 128kB 1MB 16-way 2x 256kB 2x 256kB | 4x 128kB L2 could be higher
L3 Cache(s) 4MB n/a 3MB 1MB Small L3
Special Instruction Sets
v8.2-A, VFP4, AES, SHA, TZ, Neon v8-A, VFP4, TZ, Neon AVX2/FMA, AES, VT-d/x v8.2-A, VFP4, AES, SHA, TZ, Neon All 8.2 instructions
SIMD Width / Units
128-bit 128-bit 256-bit 128-bit Neon is not wide enough
Price / RRP (USD)
$146 $55 (whole SBC) $281 $99 Pi price is for whole BMC! (including memory)

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Qualcomm, etc.). All trademarks acknowledged and used for identification only under fair use.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets, on both x64 and Arm64 platforms.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64/Arm64, latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations where supported.

Native Benchmarks Microsoft SQ2 (4C 3.15GHz + 4c 1.8GHz) Arm64 Native Raspberry Pi 4B (4C 2GHz) Arm64 Native Intel Core i5 6300 (2C/4T 2.4-3GHz) x64 Native Snapdragon 7c (2C 2.4GHz + 6c 1.8GHz) Arm64 Native Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 100
32.6 45.57** SQ is 12% faster than a i5 ULV!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102
33.42 54.52** A 64-bit integer workload is 19% faster!
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 87.86
25.1 39.05* With floating-point, SQ is 35% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 79 23.76 32.17* With FP64 SQ is almost 50% faster
In these standard legacy tests, 7c does well but cannot keep up with the newer 8cx which dominates everything. But it still beats the old SKL 2C/4T Core which is no mean feat – considering that SKL core did not evolve much for many years.

It is a good showing for Microsoft/Qualcomm and Arm64 – this means native “standard” applications (non-vectorised workload) perform well on ARM despite the huge TDP difference (5W vs 15-25W). For mobile (phone, laptop, tablet) ARM seems unbeatable.

Note*: using SSE2-3 SIMD processing.

Note**: using AVX2 processing.

BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 95.12** 36.6** 192* 44.75** [1/5x] 7c is 1/5x SKL.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 60.99** 24** 68.83* 33.16** [1/2x] With a 64-bit, 7c is 1/2x SKL.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 10.85** 3.28** 11.93* 6.89** [-43%] Using 64-bit int to emulate Int128, 7c is 43% slower.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 187** 52.5** 156* 107** [-32%] In this FP32 vectorised 7c is just 32% slower
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 106** 29.73** 89.89* 58** [-36%] Switching to FP64 7c is 36% slower.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 4.84** 1.31** 4.57* 2.26** [1/2x] Using FP64 to mantissa extend FP128, 7c is 1/2x SKL.
With heavily vectorised SIMD workloads – 7c cannot really keep up with SKL nor SQ but at least it is about 2x faster than a Pi 4B which can still run Windows 10/11 decently. Floating-point performance is especially encouraging at just 30% lower than SKL for far less money and power.

Let’s also note that Sandra itself is only recently been ported to Arm64 and thus not as well optimised as over 20-years (!) in x86 land.

Note*: using AVX2/FMA3 256-bit SIMD processing.

Note**: using NEON 128-bit SIMD processing.

BenchCrypt Crypto AES-256 (GB/s) 0.79 0.323 8* 0.41 We are working on adding AES acceleration.
BenchCrypt Crypto AES-128 (GB/s) 0.96 0.442 8.36* No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 1.69** 0.595** 2*** 0.65 We need SHA acceleration here
BenchCrypt Crypto SHA1 (GB/s) 3.1** 0.996** 4.02*** Less compute intensive SHA1.
BenchCrypt Crypto SHA2-512 (GB/s) SHA2-512 is not accelerated by SHA HWA.
7c naturally provides AES HWA (hardware acceleration) which we have not yet enabled in Sandra. Thus the relatively large performance delta vs. AES-enabled SKL Intel is thus expected and will be corrected.

7c also provides SHA HWA (hardware acceleration) – unlike SKL Intel – but this is again not yet enabled in Sandra. Multi-buffer NEON was only just released thus could not help the 7c here.

Note*: using AES HWA (hardware acceleration).

Note**: using NEON multi-buffer hashing.

Note***: using AVX2 multi-buffer hashing.

CPU Multi-Core Benchmark Inter-Module (CCX) Latency (Same Package) (ns) 45 85 42.3 Similar inter-core latency to Intel.
With big and LITTLE core clusters, the inter-core latencies vary greatly between big-2-big, LITTLE-2-LITTLE and big-2-LITTLE cores. Here we present the inter big-cores latencies that are comparable with Intel’s designs. Judicious thread scheduling is needed so that data transfers are efficient.
CPU Multi-Core Benchmark Total Inter-Thread Bandwidth – Best Pairing (GB/s) 15** 2.5** 13.26* Similar overall bandwidth to Intel.
As with latencies, inter-core bandwidths vary greatly between big-2-big, LITTLE-2-LITTLE and big-2-LITTLE cores, here we present the bandwidth between the big cores. Judicious thread scheduling is needed so that data transfers are efficient.

Note:* using AVX2 256-bit wide transfers.

Note**: using NEON 128-bit wide transfers.

BenchFinance Black-Scholes float/FP32 (MOPT/s) 101.61 24.68 Black-scholes is un-vectorised and compute heavy.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 82.42 20.53 41.06 31.26 [-23%] Using FP64, SQ is 26% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 40.22 12.8 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 19.73 5.47 10.52 6.68 [-33%] With FP64, SQ is still 26% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 36.84 8.53 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 18.95
3.48 14.9 5.55 [1/3x] Switching to FP64, SQ is only 18% slower.
With non-SIMD financial workloads, similar to what we’ve seen in legacy code (Dhrystone, Whetstone), 7c is about 30% slower than SKL – but we can see that Microsoft’s SQ with far more bandwidth is almost 3x (three times) faster! We can see how a powerful Arm64 design can make a difference.
BenchScience SGEMM (GFLOPS) float/FP32 57.38* 18.77* In this tough vectorised algorithm 7c does ok
BenchScience DGEMM (GFLOPS) double/FP64 19.5* 5.87* 34.46** 9.96* [1/3x] With FP64 vectorised code, 7c is 1/3x SKL.
BenchScience SFFT (GFLOPS) float/FP32 2.01* 0.657* FFT is also heavily vectorised but memory dependent.
BenchScience DFFT (GFLOPS) double/FP64 1.41* 0.505* 4.94** 1.62* [1/3x] With FP64 code, 7c is 1/3x slower.
BenchScience SN-BODY (GFLOPS) float/FP32 79.61* 16.47 N-Body simulation is vectorised but with more memory accesses.
BenchScience DN-BODY (GFLOPS) double/FP64 27.36* 5.83 27.68** 7.66* [1/3x] With FP64 7c is 1/3x also
With highly vectorised SIMD code (scientific workloads), it is clear that a lot of work is needed to optimise code for ARM to get it to match x86/x64 in performance. In some algorithms (GEMM, N-BODY) it is doing well – but overall more optimisations need to be made.

Note*: using AVX2/FMA3 256-bi5 SIMD.

Note**: using NEON 128-bit SIMD.

CPU Image Processing Blur (3×3) Filter (MPix/s) 451** [-39%] 165** 427* In this vectorised integer workload SQ is 39% slower.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 195** [-23%] 56.81** 169* Same algorithm but more shared data SQ is 23% slower.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 109** [-18%] 28.08** 87.95* Again same algorithm but even more data, SQ is 18% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 156** [-29%] 39.71** 146* Different algorithm but still vectorised workload SQ is 29% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 9.43** [-54%] 3** 13.57* Still vectorised code SQ is 1/2x slower.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.31** [-22%] 3.57** 6.92* In this tough filter, SQ is 22 slower.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 583** [1/2x] 174** 713* With 64-bit integer workload, SQ is 1/2x slower.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 108** [-40%] 35.26** 110* In this final test (scatter/gather) SQ is 40% slower.
We know these benchmarks *love* SIMD, with AVX2/AVX512 always performing strongly – thus Intel with 256-bit wide AVX2 has the advantage against SQ’s 128-bit NEON. In general the difference is not as high as 50% but rather 20-40% which is a good result.

Again, as with other compute-heavy algorithms – these days such algorithms would be offloaded to the Cloud or locally to the GP-GPU – thus the CPU does not need to be a SIMD speed demon.

Note*: using AVX2 256-bit SIMD.

Note**: using NEON 128-bit SIMD.

Aggregate Score (Points) 1,846 360 1,760 Across all benchmarks, SQ is 27% slower.
Overall, SQ is just 27% slower than Intel’s WHL – with the SIMD performance bringing the overall score down. However, let’s recall that non-SIMD performance is usually 2x higher than Intel’s. With optimisations we are confident the score will only improve – while Intel is pretty much fully optimised and unlikely to extract more performance.
Price/RRP (USD) $146 $55 (whole SBC) $281 SQ is 1/2 price of Intel’s CPUs.
Price Efficiency (Perf. vs. Cost) (Points/USD) 12.64
6.55 6.26 SQ is 50% better value.
Going by RRP SoC prices, SQ ends up 50% better value as despite lower performance it is half the price of Intel’s SoCs! This should translate in much cheaper tablet or better specs which is something we all want!
Power/TDP (W) 5-7W
4W 15-25W 5-7W Both 7c/SQ are low TDP
Power Efficiency (Perf. vs. Power) (W) 263 90 70.40 SQ is over 2.5x better power efficiency
SQ’s power usage is so low (~5W) that it is at least 2.5x better power efficient than Intel’s designs – and we’re even ignoring that in Turbo Intel consumes even more thus the power efficiency is stellar.

SiSoftware Official Ranker Scores

Is x86 Dead?

For tablet/mobile – it seems so… Apple knew something when they abandoned x86 for Arm…

ARM is moving very quickly, unlike x86, where despite introducing new technologies (e.g. hybrid aka big/LITTLE) seems to go one step forward, one step back (AVX512 killed). The 7c with just 2 big cores is a bit lacking – but Microsoft’s SQ in Surface Pro X shows what a relatively modern Arm64 SoC can do at far less power – and using the relatively old Cortex A76 cores, not the very latest X1/X2 ARMv9 designs.

The only weakness – SIMD vectorised performance – is not that far behind (50% aka 1/2x) and is already being addressed by SVE/SVE2 in the newest cores. SVE offers even wider SIMD registers (up to 2048-bit) – while Intel has recently canned AVX512 (512-bit) and is left with old AVX2/FMA3 (256-bit). Still, software for tablet/mobile won’t be using compute-heavy SIMD workloads anyway – with such workloads either offloaded to the Cloud or perhaps the integrated GP-GPU (Adreno here). But for performance desktops?

The power (TDP) difference is so stark (5W vs. 15-25W on Intel) that it should be possible to double the number of Arm64 cores (e.g. 8 big + 8 LITTLE) and still consume less power (e.g. 10-15W) than the competition. We really need a SoC like Apple’s M1 (or better) that can rival higher compute power devices.

Final Thoughts / Conclusions

Qualcomm Snapdragon 7c: OK for a cheap Arm64 tablet: 7/10

The Snapdragon 7c – despite being just 2 years old (2020) is already obsolete – with many more powerful versions coming out all the time. With just 2 big cores, it was always going to be a low-end SoC – but it is still a big step-up if you are used to a Raspberry Pi 4B 😉

There are quite a few low-cost Arm64 Windows laptops/tablets from OEMs (Samsung Galaxy Book, Lenovo, ECS , etc.) that are similar in cost to Intel Atom devices and a bit more expensive than Chromebooks. There is also Microsoft with Surface Pro X – that have gone for the high(er) end 8cx version.

Thus the 7c can only keep up with older x86 Core 2C/4T systems or x86 Atom 4C systems – but can still be a choice considering that Windows 11 does not support many older but perfectly capable x86 Intel systems (e.g. Core 6th gen and older – aka Skylake). In addition Intel itself has canned driver support for these systems – thus users will be stuck with Windows 10 and older drivers.

The WoW (Windows-on-Windows) emulation of non-native applications (x86/x64) is somewhat limited than, say, Apple’s. It is about slower 50% (1/2x) than native Arm64 and only supports 32-bit x86 applications, while most software today is provided as 64-bit x64. Microsoft really needs to improve this and quickly…

The power usage is so low that the power efficiency (performance vs. power) is much higher than the competition and thus allows for far longer battery life – which on a tablet is likely much more important than raw power. For low-level compute tasks (browsing, word processing, some spreadsheets, media consumption, etc.) – the 7c may just be sufficient.

Qualcomm Snapdragon 7c: OK for a cheap Arm64 tablet: 7/10

Further Articles

Please see our other articles on:

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Qualcomm, etc.). All trademarks acknowledged and used for identification only under fair use.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Tagged , , , , , , , . Bookmark the permalink.

Comments are closed.