Windows Arm64 on Qualcomm Snapdragon 7c Performance

What is “Qualcomm’s Snapdragon”?

It is a family of ARM processors/SoCs – analogous to the Intel’s “Core” in the x86 world. It now spans 8 series going back years and is used for everything from laptops, tablets, phones and even smart-watches!

Unlike other ARM SoC manufacturers (Samsung, Huawei, MediaTek, etc.), Qualcomm has an ARM design licence (like Apple) and modifies the standard Cortex ARM designs and re-brands them “Kryo”. In big/LITTLE (DynamIQ) designs – the big cores are branded “Gold” and LITTLE cores “Silver”. In the latest 3-core types design – the top-end cores are branded “Prime”.

Qualcomm has designed custom “Mobile PC/Compute Platform” SoCs for Arm64 Windows 10/11 – that include the 7c series reviewed here and the updated 8cx (also known as Microsoft SQ). Older versions powered previous versions of Windows RT as well as Windows Phone, Windows CE, etc.

Using these ARM SoC family has allowed OEMs to launch lower cost laptops and tablets – e.g. Samsung Galaxy Book, Lenovo, ECS (EliteGroup Compute Systems), etc. that are similar in cost to Intel’s Atom devices but including features like 4G/LTE that are not normally found at this price range. While somewhat more expensive than Chromebooks – they run the full version of Windows 10/11 (even Pro/Enterprise) and can run 32-bit x86 applications under emulation (WoW – Windows on Windows).

Microsoft has used the high-end version 8cx (rebranded SQ) to launch an alternative Surface Pro X line, at slightly lower cost than the x86 Surface Pro line but also including 4G/LTE as standard.

What is “Windows Arm64”?

It is the 64-bit version of client Windows 10/11 for Arm64 (AArch64) devices – analogous to the current x64 Windows 10/11 for Intel & AMD CPUs. While “desktop” Windows 8.x has been available for ARM (AArch32) as Windows RT – it did not allow running of non-Microsoft native Win32 applications (like Sandra) and also did not support emulation for current x86/x64 applications.

We should also recall that Windows Phone (though now dead) has always run on ARM devices (and other architectures) and was first to unify the Windows desktop and Windows CE kernels. Windows was first ported to 64-bit with the long-dead Alpha64 (Windows NT 4), then Itanium IA64 (Windows XP 64), x64 (AMD64/EM64T) and now Arm64 – which shows the versatility of the NT micro-kernel.

By contrast, Windows 10/11 Arm64 is able to run native AArch64 applications (compiled for Arm64 like Sandra) as well as emulating 32-bit x86 and ARM (AArch32) applications through “WoW” (Windows on Windows emulation). But *not* x64 applications! Ouch!

While Arm64 Windows 10/11 includes in-box drivers for many peripherals/devices – it may not have a driver for very new peripheral/device; some manufacturers may not (even) provide any support for Arm64. For some devices, standard in-box “class” drivers (e.g. NVMe controller/SSD, AHCI controller/SSD, USB controller, keyboard, mouse, etc.) do work but otherwise a custom driver is required. [Note there has never been any emulation for kernel drivers, in any version of Windows]

Qualcomm 7c SoC Details

Qualcomm Snapdragon SC7180
ARMv8.2-A (Arm64) 64-bit cores
2x Kryo 468 Gold “big-cores” (based on Cortex-A76) @ 2.4GHz
- Arm64 v8.2 4-wide OoO design
- 64kB + 64kB L1, 128-512kB L2 per core
- ~25-33% performance increase over Cortex A75
  - Cortex A75 provided ~20% performance increase over Cortex A73
  - Cortex A73 provided ~30% performance increase over Cortex A72 (e.g. Raspberry Pi 4B cores)
6x Kryo 468 Silver “little-cores” (based on Cortex-A55) @ 1.8GHz
- Arm64 v8.2 2-wide in-order
- 32kB + 32kB L1, 64kB-256kB L2 per core
- ~15% power efficiency, ~18% performance increase over Cortex A53 (e.g. Raspberry Pi 3B(+) cores)
8nm process (Samsung)
TDP ~7W
Unified 1MB L3 / LLC cache
System Cache 1MB (also used by GP-GPU)
Memory LP-DDR4X 2x 2133 (4266) 2x 16-bit ~17GB/s
Qualcomm Adreno 618 GP-GPU
- 1CU 256SP 700MHz core clock
- DirectX 12, 11 support
- No OpenCL support in the driver. Boo!

Compared to Intel SoCs, we have more “real” cores (8) but the same number of threads (8T) as ARM has not implemented SMT (Hyperthreading) as Intel and AMD have. Otherwise, we have a pretty modern process (8nm), decent clocks (in excess on 2.4GHz for big cores), but small L3/LLC cache (1MB) and low-bandwidth LP-DDR4X memory (just 2 channels of 16-bit).

ARM Advanced Instructions Support

SIMD: As in the x86 world, ARM supports SIMD instructions called “NEON” operating on 128-bit width registers – equivalent to SSE2. However, there are 32 of them – while x64 SSE/AVX only provide 16 – until AVX512 which also provides 32. In most algorithms we can use them in batches of 4 – effectively making them 512-bit!

Unlike newer cores, the older A7X cores do not support SVE (“Scalable Vector eXtensions” – the successor of NEON); current designs are still just 128-bit width although they do provide more flexibility especially when implementing complex algorithms.

Crypto: Similar to x86, ARM does provide hardware-accelerated (HWA) encryption/decryption (AES, SM3) as well as hashing (SHA1, SHA2, SHA3, SM4). Unlike, say, the Raspberry Pi SoCs, Qualcomm naturally includes crypto extensions. Yay!

Virtualisation: The ARM cores do have hardware virtualisation – and here Qualcomm has the necessary firmware functionality to enable Hyper-V on Windows 11 to run corresponding ARM/Arm64 virtual machines. Or perhaps emulate x86 through QEMU? Both VmWare and Proxmox include support should you wish to use different virtualisation technology. Yay!

Security Extensions: ARM cores since ARMv6 (!) have included TrustZone secure virtualisation. AMD’s own recent CPUs all contain an ARM core (Cortex A5?) supporting TrustZone handling the security functionality (e.g. PSP / firmware-emulated TPM). See our article
Crypto-processor (TPM) Benchmarking: Discrete vs. internal AMD, Intel, Microsoft HV.

TPM: in order to support Windows 11, Qualcomm does include TPM support. Thus you can use BitLocker disk encryption as well as other security-based virtualisation features (like Core Isolation / Memory Integrity). Yay!

Changes in Sandra to support ARM

As a relatively old piece of software (launched all the way back in 1997, yep we’re really old), Sandra contains a good amount of legacy but optimised code, with the benchmarks generally written in assembler (MASM, NASM and previously YASM) for x86/x64 using various SIMD instruction sets: SSE2, SSSE3, SSE4.1/4.2, AVX/FMA3, AVX2 and finally AVX512. All this had to be translated in more generic C/C++ code using templated instrinsics implementation for both x86/x64 and ARM/Arm64.

As a result, some of the historic benchmarks in Sandra have substantially changed – with the scores changing to some extent. This cannot be helped as it ensures benchmarking fairness between x86/x64 and ARM/Arm64 going forward.

For this reason we recommend using the very latest version of Sandra and keep up with updated versions that likely fix bugs, improve performance and stability.

CPU core performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

CPU/SoC

Hardware Specifications

We are comparing the Arm64 processors with x86/x64 processors of similar vintage – all running current Windows 10/11, latest drivers.

Specifications	Microsoft SQ2	Raspberry Pi 4B	Intel Core i5 6300	Snapdragon 7c	Comments
Arch(itecture)	Kryo 495 Gold (Cortex A76) + Kryo 495 Silver (Cortex A55) 7nm	Cortex A72 16nm	Skylake-ULV (Gen6)	Kyro 468 Gold (Cortex A76) + Kyro 468 Silver (Cortex A55) 8nm	big+LITLE cores vs. SMT
Launch Date	Q3 2020	2019	Q3 2015	2020	Older design but still recent
Cores (CU) / Threads (SP)	4C + 4c (8T)	4C / 4T	2C / 4T	2C + 6c / 8T	Same number of threads, but fewer big cores
Rated Speed (GHz)	1.8	1.5	2.5	1.8	Similar base clock
All/Single Turbo Speed (GHz)	3.14	2.0	3.0	2.4	Turbo could be higher
Rated/Turbo Power (W)	~5-7	~4-5	15-25	~7	Much lower TDP, 1/3x Intel
L1D / L1I Caches	4x 64kB \| 4x 32kB	4x 32kB 2-way \| 4x 48kB 3-way	2x 32kB \| 2x 32kB	2x 64kB \| 4x 32kB	Similar L1 caches
L2 Caches	4x 512kB \| 4x 128kB	1MB 16-way	2x 256kB	2x 256kB \| 4x 128kB	L2 could be higher
L3 Cache(s)	4MB	n/a	3MB	1MB	Small L3
Special Instruction Sets	v8.2-A, VFP4, AES, SHA, TZ, Neon	v8-A, VFP4, TZ, Neon	AVX2/FMA, AES, VT-d/x	v8.2-A, VFP4, AES, SHA, TZ, Neon	All 8.2 instructions
SIMD Width / Units	128-bit	128-bit	256-bit	128-bit	Neon is not wide enough
Price / RRP (USD)	$146	$55 (whole SBC)	$281	$99	Pi price is for whole BMC! (including memory)

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Qualcomm, etc.). All trademarks acknowledged and used for identification only under fair use.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets, on both x64 and Arm64 platforms.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64/Arm64, latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations where supported.

Native Benchmarks		Microsoft SQ2 (4C 3.15GHz + 4c 1.8GHz) Arm64 Native	Raspberry Pi 4B (4C 2GHz) Arm64 Native	Intel Core i5 6300 (2C/4T 2.4-3GHz) x64 Native	Snapdragon 7c (2C 2.4GHz + 6c 1.8GHz) Arm64 Native	Comments

	Native Dhrystone Integer (GIPS)	100	32.6	45.57**		SQ is 12% faster than a i5 ULV!
	Native Dhrystone Long (GIPS)	102	33.42	54.52**		A 64-bit integer workload is 19% faster!
	Native FP32 (Float) Whetstone (GFLOPS)	87.86	25.1	39.05*		With floating-point, SQ is 35% faster!
	Native FP64 (Double) Whetstone (GFLOPS)	79	23.76	32.17*		With FP64 SQ is almost 50% faster
In these standard legacy tests, 7c does well but cannot keep up with the newer 8cx which dominates everything. But it still beats the old SKL 2C/4T Core which is no mean feat – considering that SKL core did not evolve much for many years. It is a good showing for Microsoft/Qualcomm and Arm64 – this means native “standard” applications (non-vectorised workload) perform well on ARM despite the huge TDP difference (5W vs 15-25W). For mobile (phone, laptop, tablet) ARM seems unbeatable. Note: using SSE2-3 SIMD processing. Note*: using AVX2 processing.

	Native Integer (Int32) Multi-Media (Mpix/s)	95.12**	36.6**	192*	44.75** [1/5x]	7c is 1/5x SKL.
	Native Long (Int64) Multi-Media (Mpix/s)	60.99**	24**	68.83*	33.16** [1/2x]	With a 64-bit, 7c is 1/2x SKL.
	Native Quad-Int (Int128) Multi-Media (Mpix/s)	10.85**	3.28**	11.93*	6.89** [-43%]	Using 64-bit int to emulate Int128, 7c is 43% slower.
	Native Float/FP32 Multi-Media (Mpix/s)	187**	52.5**	156*	107** [-32%]	In this FP32 vectorised 7c is just 32% slower
	Native Double/FP64 Multi-Media (Mpix/s)	106**	29.73**	89.89*	58** [-36%]	Switching to FP64 7c is 36% slower.
	Native Quad-Float/FP128 Multi-Media (Mpix/s)	4.84**	1.31**	4.57*	2.26** [1/2x]	Using FP64 to mantissa extend FP128, 7c is 1/2x SKL.
With heavily vectorised SIMD workloads – 7c cannot really keep up with SKL nor SQ but at least it is about 2x faster than a Pi 4B which can still run Windows 10/11 decently. Floating-point performance is especially encouraging at just 30% lower than SKL for far less money and power. Let’s also note that Sandra itself is only recently been ported to Arm64 and thus not as well optimised as over 20-years (!) in x86 land. Note: using AVX2/FMA3 256-bit SIMD processing. Note*: using NEON 128-bit SIMD processing.

	Crypto AES-256 (GB/s)	0.79	0.323	8*	0.41	We are working on adding AES acceleration.
	Crypto AES-128 (GB/s)	0.96	0.442	8.36*		No change with AES128.
	Crypto SHA2-256 (GB/s)	1.69**	0.595**	2***	0.65	We need SHA acceleration here
	Crypto SHA1 (GB/s)	3.1**	0.996**	4.02***		Less compute intensive SHA1.
	Crypto SHA2-512 (GB/s)					SHA2-512 is not accelerated by SHA HWA.
7c naturally provides AES HWA (hardware acceleration) which we have not yet enabled in Sandra. Thus the relatively large performance delta vs. AES-enabled SKL Intel is thus expected and will be corrected. 7c also provides SHA HWA (hardware acceleration) – unlike SKL Intel – but this is again not yet enabled in Sandra. Multi-buffer NEON was only just released thus could not help the 7c here. Note: using AES HWA (hardware acceleration). Note: using NEON multi-buffer hashing. Note**: using AVX2 multi-buffer hashing.

	Inter-Module (CCX) Latency (Same Package) (ns)	45	85	42.3		Similar inter-core latency to Intel.
With big and LITTLE core clusters, the inter-core latencies vary greatly between big-2-big, LITTLE-2-LITTLE and big-2-LITTLE cores. Here we present the inter big-cores latencies that are comparable with Intel’s designs. Judicious thread scheduling is needed so that data transfers are efficient.

	Total Inter-Thread Bandwidth – Best Pairing (GB/s)	15**	2.5**	13.26*		Similar overall bandwidth to Intel.
As with latencies, inter-core bandwidths vary greatly between big-2-big, LITTLE-2-LITTLE and big-2-LITTLE cores, here we present the bandwidth between the big cores. Judicious thread scheduling is needed so that data transfers are efficient. Note:* using AVX2 256-bit wide transfers. Note**: using NEON 128-bit wide transfers.

	Black-Scholes float/FP32 (MOPT/s)	101.61	24.68			Black-scholes is un-vectorised and compute heavy.
	Black-Scholes double/FP64 (MOPT/s)	82.42	20.53	41.06	31.26 [-23%]	Using FP64, SQ is 26% faster.
	Binomial float/FP32 (kOPT/s)	40.22	12.8			Binomial uses thread shared data thus stresses the cache & memory system.
	Binomial double/FP64 (kOPT/s)	19.73	5.47	10.52	6.68 [-33%]	With FP64, SQ is still 26% faster.
	Monte-Carlo float/FP32 (kOPT/s)	36.84	8.53			Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
	Monte-Carlo double/FP64 (kOPT/s)	18.95	3.48	14.9	5.55 [1/3x]	Switching to FP64, SQ is only 18% slower.
With non-SIMD financial workloads, similar to what we’ve seen in legacy code (Dhrystone, Whetstone), 7c is about 30% slower than SKL – but we can see that Microsoft’s SQ with far more bandwidth is almost 3x (three times) faster! We can see how a powerful Arm64 design can make a difference.

	SGEMM (GFLOPS) float/FP32	57.38*	18.77*			In this tough vectorised algorithm 7c does ok
	DGEMM (GFLOPS) double/FP64	19.5*	5.87*	34.46**	9.96* [1/3x]	With FP64 vectorised code, 7c is 1/3x SKL.
	SFFT (GFLOPS) float/FP32	2.01*	0.657*			FFT is also heavily vectorised but memory dependent.
	DFFT (GFLOPS) double/FP64	1.41*	0.505*	4.94**	1.62* [1/3x]	With FP64 code, 7c is 1/3x slower.
	SN-BODY (GFLOPS) float/FP32	79.61*	16.47			N-Body simulation is vectorised but with more memory accesses.
	DN-BODY (GFLOPS) double/FP64	27.36*	5.83	27.68**	7.66* [1/3x]	With FP64 7c is 1/3x also
With highly vectorised SIMD code (scientific workloads), it is clear that a lot of work is needed to optimise code for ARM to get it to match x86/x64 in performance. In some algorithms (GEMM, N-BODY) it is doing well – but overall more optimisations need to be made. Note: using AVX2/FMA3 256-bi5 SIMD. Note*: using NEON 128-bit SIMD.

	Blur (3×3) Filter (MPix/s)	451** [-39%]	165**	427*		In this vectorised integer workload SQ is 39% slower.
	Sharpen (5×5) Filter (MPix/s)	195** [-23%]	56.81**	169*		Same algorithm but more shared data SQ is 23% slower.
	Motion-Blur (7×7) Filter (MPix/s)	109** [-18%]	28.08**	87.95*		Again same algorithm but even more data, SQ is 18% slower.
	Edge Detection (2*5×5) Sobel Filter (MPix/s)	156** [-29%]	39.71**	146*		Different algorithm but still vectorised workload SQ is 29% slower.
	Noise Removal (5×5) Median Filter (MPix/s)	9.43** [-54%]	3**	13.57*		Still vectorised code SQ is 1/2x slower.
	Oil Painting Quantise Filter (MPix/s)	9.31** [-22%]	3.57**	6.92*		In this tough filter, SQ is 22 slower.
	Diffusion Randomise (XorShift) Filter (MPix/s)	583** [1/2x]	174**	713*		With 64-bit integer workload, SQ is 1/2x slower.
	Marbling Perlin Noise 2D Filter (MPix/s)	108** [-40%]	35.26**	110*		In this final test (scatter/gather) SQ is 40% slower.
We know these benchmarks love SIMD, with AVX2/AVX512 always performing strongly – thus Intel with 256-bit wide AVX2 has the advantage against SQ’s 128-bit NEON. In general the difference is not as high as 50% but rather 20-40% which is a good result. Again, as with other compute-heavy algorithms – these days such algorithms would be offloaded to the Cloud or locally to the GP-GPU – thus the CPU does not need to be a SIMD speed demon. Note: using AVX2 256-bit SIMD. Note*: using NEON 128-bit SIMD.

	Aggregate Score (Points)	1,846	360	1,760		Across all benchmarks, SQ is 27% slower.
Overall, SQ is just 27% slower than Intel’s WHL – with the SIMD performance bringing the overall score down. However, let’s recall that non-SIMD performance is usually 2x higher than Intel’s. With optimisations we are confident the score will only improve – while Intel is pretty much fully optimised and unlikely to extract more performance.

	Price/RRP (USD)	$146	$55 (whole SBC)	$281		SQ is 1/2 price of Intel’s CPUs.

	Price Efficiency (Perf. vs. Cost) (Points/USD)	12.64	6.55	6.26		SQ is 50% better value.
Going by RRP SoC prices, SQ ends up 50% better value as despite lower performance it is half the price of Intel’s SoCs! This should translate in much cheaper tablet or better specs which is something we all want!

	Power/TDP (W)	5-7W	4W	15-25W	5-7W	Both 7c/SQ are low TDP

	Power Efficiency (Perf. vs. Power) (W)	263	90	70.40		SQ is over 2.5x better power efficiency
SQ’s power usage is so low (~5W) that it is at least 2.5x better power efficient than Intel’s designs – and we’re even ignoring that in Turbo Intel consumes even more thus the power efficiency is stellar.

SiSoftware Official Ranker Scores

Is x86 Dead?

For tablet/mobile – it seems so… Apple knew something when they abandoned x86 for Arm…

ARM is moving very quickly, unlike x86, where despite introducing new technologies (e.g. hybrid aka big/LITTLE) seems to go one step forward, one step back (AVX512 killed). The 7c with just 2 big cores is a bit lacking – but Microsoft’s SQ in Surface Pro X shows what a relatively modern Arm64 SoC can do at far less power – and using the relatively old Cortex A76 cores, not the very latest X1/X2 ARMv9 designs.

The only weakness – SIMD vectorised performance – is not that far behind (50% aka 1/2x) and is already being addressed by SVE/SVE2 in the newest cores. SVE offers even wider SIMD registers (up to 2048-bit) – while Intel has recently canned AVX512 (512-bit) and is left with old AVX2/FMA3 (256-bit). Still, software for tablet/mobile won’t be using compute-heavy SIMD workloads anyway – with such workloads either offloaded to the Cloud or perhaps the integrated GP-GPU (Adreno here). But for performance desktops?

The power (TDP) difference is so stark (5W vs. 15-25W on Intel) that it should be possible to double the number of Arm64 cores (e.g. 8 big + 8 LITTLE) and still consume less power (e.g. 10-15W) than the competition. We really need a SoC like Apple’s M1 (or better) that can rival higher compute power devices.

Final Thoughts / Conclusions

Qualcomm Snapdragon 7c: OK for a cheap Arm64 tablet: 7/10

The Snapdragon 7c – despite being just 2 years old (2020) is already obsolete – with many more powerful versions coming out all the time. With just 2 big cores, it was always going to be a low-end SoC – but it is still a big step-up if you are used to a Raspberry Pi 4B 😉

There are quite a few low-cost Arm64 Windows laptops/tablets from OEMs (Samsung Galaxy Book, Lenovo, ECS , etc.) that are similar in cost to Intel Atom devices and a bit more expensive than Chromebooks. There is also Microsoft with Surface Pro X – that have gone for the high(er) end 8cx version.

Thus the 7c can only keep up with older x86 Core 2C/4T systems or x86 Atom 4C systems – but can still be a choice considering that Windows 11 does not support many older but perfectly capable x86 Intel systems (e.g. Core 6th gen and older – aka Skylake). In addition Intel itself has canned driver support for these systems – thus users will be stuck with Windows 10 and older drivers.

The WoW (Windows-on-Windows) emulation of non-native applications (x86/x64) is somewhat limited than, say, Apple’s. It is about slower 50% (1/2x) than native Arm64 and only supports 32-bit x86 applications, while most software today is provided as 64-bit x64. Microsoft really needs to improve this and quickly…

The power usage is so low that the power efficiency (performance vs. power) is much higher than the competition and thus allows for far longer battery life – which on a tablet is likely much more important than raw power. For low-level compute tasks (browsing, word processing, some spreadsheets, media consumption, etc.) – the 7c may just be sufficient.

Qualcomm Snapdragon 7c: OK for a cheap Arm64 tablet: 7/10

Further Articles

Please see our other articles on:

CPU/SoC

Disclaimer

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!