Intel ARC (“Alchemist”) A380 GP-GPU Graphics – OpenCL Performance

What is “Alchemist”?

Is is the 2nd generation Intel discrete (DG2) graphics – based on the Xe arch that originally powered the integrated graphics of mobile (Xe-LP) processors from Gen 10 (“TigerLake” TGL) onwards. Intel did launch a limited desktop DG1 product as well as a limited server/workstation Xe-HP compute version.

With DG2 – Intel is back with a full stack, from low-end, middle and high-end – as well as workstation/server (data as per Tom’s Hardware : Intel Arc Alchemist: Release Date, Specs, Everything We Know):

Up to 512CU / 4,096 SP
- A300 series 128CU / 1,024SP
- A500 series 384CU / 3,072SP
- A700 series 512CU / 4,096SP
Up to 16GB GDDR6
- A300? series 6GB GDDR6 96-bit 14Gbit/s ~192GB/s
- A500? series 12GB GDDR6 192-bit 16Gbit/s ~384GB/s
- A700? series 16GB GDDR6 256-bit 16Gbit/s ~512GB/s
7nm TSMC process
OpenCL 3.0, DirectX 12 Ultimate support
FP16 half-floating point support
NO FP64 double floating-point support (!)
Tensors (“Matrix Engines” aka XMX) for Deep Learning
- FP16 and Int8 support (but no FP32/FP64 support)
- OpenCL extension – will need to be coded specifically to use it

The specification look good – we have competitive specs, although we seem to be missing native FP64 support, just as on the mobile/desktop previous Xe. We do gain tensors (matrix multipliers) and support for FP16/Int8 that should help low-precision neural networks. For high-precision workloads (FP64), it seems AMD is your only option – unless paying nVidia for professional graphics…

ARC “Alchemist” A380 GP-GPU Performance Benchmarking

In this article we test GP-GPU core performance; please see our other articles on:

GP-GPU
- Intel DG1 (Iris Xe Max Gen12) Review & Benchmarks – GP-GPU Performance
- Intel Iris Plus G7 Gen11 IceLake ULV (i7-1065G7) Review & Benchmarks – GP-GPU Performance

Hardware Specifications

We are comparing the “entry-level” range A380 GP-GPU with the competition with a view to upgrading an inexpensive system.

Specifications	Intel Arc A380 (DG2)	nVidia GeForce GTX 1660 Ti	nVidia GeForce RTX 3050	AMD Radeon RX 6500 XT	Comments
Arch / Chipset	Xe2 (EV12+)	TU116 (Turing)	GA106 (Ampere)	Navi II 24 XT	Gen 12+ graphics – the latest.
Cores (CU) / Threads (SP)	128 / 1,024	12 / 1,536	20 / 1,280	16 / 1,024	Similar SPs count as competition
Tensor (TU) Cores	16	–	20	–	Now with extra tensors on top!
Speed (Min-Turbo)	~2.45GHz	1.5-1.78GHz	1.55-1.78GHz	2.3-2.6GHz	Turbo speed is pretty high.
Power (TDP)	~75W	~120W	~130W	~107W	TDP is lowest… on paper at least
ROP / TMU	32 / 64	48 / 96	32 / 80	32 / 64	ROPs and TMUs also increased.
Shared Memory	64kB	48kB	96kB	64kB	Standard shared memory size.
Constant Memory	2.4GB	64kB	64kB	2GB	Using global as const sadly like AMD.
Global Memory	6GB GDDR6 96-bit	6GB GDDR6 192-bit	8GB GDDR6 128-bit	4GB GDDR6 64-bit	Quite narrow bus but decent memory
Memory Bandwidth	~192GB/s	~288GB/s	~224GB/s	~144GB/s	Somewhat low bandwidth
L1 Caches	64kB	64kB	128kB	128kB	L1 could be larger
L2 / L3 Cache	1MB	1.5MB	2MB	1MB / 16MB	L2 seems quite small
Maximum Work-group Size	1,024 x 1,024	1,024 x 1,024	1,024 x 1,024	1,024 x 1,024	Same workgroup size
FP64/double ratio	No!	Yes 1/32x	Yes 1/64x	Yes 1/16x	No FP64 support?
FP16/half ratio	Yes, 2x	Yes, 2x	Yes, 2x	Yes, 2x	Same 2x ratio
Price / RRP (USD)	~$199?	~$280	~$249	~$199	Keen price

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

The review contains only public information and not provided under NDA nor embargoed. At publication time, the products have not been directly tested by SiSoftware but submitted to the public Benchmark Ranker; thus the accuracy of the benchmark scores cannot be verified, however, they appear consistent and pass current validation checks.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Native OpenCL Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel graphics drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks		Intel Arc A380 (DG2) OpenCL	nVidia GeForce GTX 1660 Ti (Turing) CUDA	nVidia GeForce RTX 3050 (Ampere) CUDA	AMD Radeon RX 6500 XT (Navi II) OpenCL	Comments

	Mandel FP16/Half (Mpix/s)	12,948 [-17%]	15,401	15,612	11,057	ARC is 17% slower than nVidia.
	Mandel FP32/Single (Mpix/s)	6,389 [-44%]	8,896	11,507	7,347	Standard FP32 is almost 1/2x slower.
	Mandel FP64/Double (Mpix/s)	312* [+22%]	289	256	591	Without native support, ARC is 22% faster!
	Mandel FP128/Quad (Mpix/s)	*33.95 [+3.5x]**	10.94	9.65		Emulated FP128 is over 3x faster!
Starting off, we see competitive FP16 performance (beating 2 out of 3) but FP32 performance is the lowest of them all. Despite having no native support for FP64 – Sandra’s FP32 emulation is faster than the 1/32 or 1/64 ratio of nVidia’s GPUs. AMD is the only one with good FP64 performance. * Emulated FP64 through FP32, no native support (!)

	Crypto AES-256 (GB/s)	6.83 [1/5x]	18.19	30.62	24.03	AES does not suit ARC, it is 1/5x speed of nVidia!
	Crypto AES-128 (GB/s)	9.84	25.1			Nothing much changes here sadly.

	Crypto SHA2-256 (GB/s)	64.16 [+5%]	99.06	60.87	70.39	ARC is competitive here but not fastest.
	Crypto SHA1 (GB/s)	74.19	134.6			With SHA1 nothing much changes.
Unlike what we’ve seen when we tested the mobile/desktop Xe versions – ARC does seem not do well in integer workloads, with AES in particular being very slow (some fixes needed, hello Intel?) while hashing is competitive but nothing special. nVidia (1660) still rules here though AMD usually does well in hashing workloads due to OpenCL extensions that we use for accelerating hashing. We need more!

	Black-Scholes float/FP16 (MOPT/s)	7,363				B/S benefits from FP16 reduction.
	Black-Scholes float/FP32 (MOPT/s)	4,007 [-40%]	8,349	6,632	3,815	With FP32, ARC is 40% slower.
	Binomial half/FP16 (kOPT/s)	1,157				Binomial uses thread shared data thus stresses the memory system.
	Binomial float/FP32 (kOPT/s)	1,116 [-48%]	1,553	2,128	1,285	With FP32, ARC is 48% slower
	Monte-Carlo half/FP16 (kOPT/s)	3,340				Monte-Carlo also uses thread shared data but read-only.
	Monte-Carlo float/FP32 (kOPT/s)	3,427 [-32%]	4,253	5,039	3,276	With FP32 ARC is 32% slower.
For financial FP32/FP16 workloads, ARC is 30-50% slower than the competition, not something we would expect. Unlike what we’ve seen with fractals, FP16 does not always bring big gains (except on Black-Scholes). No point to test or mention lack of native FP64 again. You are not going to be running 64-bit financial workloads on this GP-GPU. For a discrete GP-GPU this is a bit unexpected but then again it’s low-end.

	HGEMM (GFLOPS) float/FP16	2,987	1,247			ARC does well against nVidia.
	SGEMM (GFLOPS) float/FP32	2,454 [-55%]	3,390	5,405	1,942	With FP32, ARC is 55% slower.
	HFFT (GFLOPS) float/FP16	218	438			ARC is 1/2 nVidia here.
	SFFT (GFLOPS) float/FP32	135 [-56%]	285	248	137	With FP32, ARC is 56% slower.
	HNBODY (GFLOPS) float/FP16	5,728	4,110			ARC is 25% faster with FP16.
	SNBODY (GFLOPS) float/FP32	3,268 [-27%]	3,007	4,462	2,554	With FP32, ARC is 27% slower
On scientific algorithms (FP32 and FP16), ARC does not do much better – it is again between 25-55% slower than the nVidia competition but at least seems to beat AMD. Hopefully deploying the tensors will improve performance and allow it to gain ground. Shall we mention lack of FP64 again? No we won’t.

	Blur (3×3) Filter single/FP16 (MPix/s)	7,351	10,508			FP16 performance is better but not 2x.
	Blur (3×3) Filter single/FP32 (MPix/s)	5,608 [-56%]	8,051	10,012	9,350	In this 3×3 convolution, ARC is 1/2x speed
	Sharpen (5×5) Filter single/FP16 (MPix/s)	3,748	3,915			Again FP16 is 2x faster.
	Sharpen (5×5) Filter single/FP32 (MPix/s)	1,526 [-66%]	2,161	3,482	3,629	Same algorithm but more shared data, 66% slower.
	Motion Blur (7×7) Filter single/FP16 (MPix/s)	1,982	3,814			FP16 is 50% faster.
	Motion Blur (7×7) Filter single/FP32 (MPix/s)	1,509 [-57%]	2,231	3,519	3,207	With even more data ARC is 57% slower.
	Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s)	3,421	3,761			FP16 is almost 2x faster but not fastest.
	Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s)	1,530 [-55%]	2,256	3,412	3,625	Still convolution but with 2 filters – 55% slower.
	Noise Removal (5×5) Median Filter single/FP16 (MPix/s)	36.12	51.51			FP16 is over 2x faster!
	Noise Removal (5×5) Median Filter single/FP32 (MPix/s)	16.72 [1/5x]	26.02	75.59	37.64	Different algorithm ARC is 1/5 the speed!
	Oil Painting Quantise Filter single/FP16 (MPix/s)	61.76	24.91			FP16 is just 4% faster.
	Oil Painting Quantise Filter single/FP32 (MPix/s)	126 [3.2x]	29.45	38.6	37.64	Without major processing, ARC is 3x faster here.
	Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s)	16,216	8,311			FP16 is not much faster.
	Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s)	16,147 [-33%]	19,772	21,347	14,779	This algorithm is 64-bit int heavy ARC is 33 slower
	Marbling Perlin Noise 2D Filter single/FP16 (MPix/s)	2,599	622			FP16 is somehow slower.
	Marbling Perlin Noise 2D Filter single/FP32 (MPix/s)	3,967 [6x]	779	665	6,600	One of the most complex filters, ARC is 6x faster.
For image processing tasks (FP16, FP32), ARC does not improve much – at least it beats AMD – but nVidia has nothing to worry about. (no FP64 processing here, few picture filters would require that kind of precision 😉 FP16 performance is encouraging, though again we have some questionable results – again optimisations and fixes may be needed to take full advantage of improved processing speed. (Hello, Intel?)

	Internal Memory Bandwidth (GB/s)	137 [-29%]	238	194	114	ARC has 30% less bandwidth.
	Upload Bandwidth (GB/s)	5.23 [-55%]	12	11.71	2.88	Uploads seem to be very slow
	Download Bandwidth (GB/s)	5.21 [-57%]	12	12.25	3.06	Downloads are also slow.
ARC seems to have the lowest bandwidth (except AMD) and upload/downloads through the PCIe4 bus are also very slow for some reason. nVidia just dominates everything here. Seems the OpenCL code-path requires some more workarounds.

	Aggregate Score (Points)	7,661 [-42%]	11,870	13,230	8,350	Across all benchmarks, ARC is 42% slower.
Despite good results overall, ARC seems to still needs some driver work/optimisations to reach its full potential. It ends up 40% slower than nVidia’s 1660 Ti that has the advantage of rock-solid CUDA against variable OpenCL performance.

	Price/RRP (USD)	~$199 [-20%]	~$280	~$249	~$199	The price is good at 20% lower than nVidia.

	Price Efficiency (Perf. vs. Cost) (Points/USD)	38.50 [-28%]	42.39	53.13	41.96	ARC ends up the 28% less price effective.
Going by RRP prices, ARC ends up the least value out of them all – 28% less “bang-per-buck” than nVidia’s 1660 Ti; it is lower than even AMD’s Navi II. Yes, nVidia still rules despite the higher price – you do get much better overall performance – but you need to pony up more money. You get what you pay for, right?

	Power/TDP (W)	~75W [-42%]	~120W	~130W	~107W	ARC’s TDP is 42% lower than nVidia’s.

	Power Efficiency (Perf. vs. Power) (W)	102 [=]	99	102	78.04	Somehow ARC ends up absolutely tied with nVidia.
Going by TDP (which may or may not reflect reality) – ARC ends up absolutely tied up with nVidia’s 1660 Ti – no mean feat. nVidia’s 3050 ends up a whisker less efficient and AMD brings up the rear.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Summary: OK for the price, nothing special: 7/10

Intel is back doing discrete graphics – yay! There are more options at the low end – yay! No need to pair discrete AMD or nVidia GP-GPU with Intel CPU/Mainboard – err… not so fast…

nVidia has nothing to worry about – here we’ve only tested the low-end – and despite being more expensive it is definitely worth it. CUDA performance is the most consistent and even the “old” Turing has no problem dispatching both Intel ARC and AMD “Navi II” competition. Again, OpenCL support still leaves a lot to be desired (both Intel and AMD) and DirectX is really for games, not optimised for compute. Still, the A380 is competitive and should improve with drivers and optimisations – should Intel decide to support ISVs like ourselves – but your 2nd choice should be AMD as Navi has been around much longer and the drivers are mature.

As with integrated graphics versions, we lack FP64 support though at this (entry) level it does not matter – though for the top-end (Xe-HP) lack of it would be quite disastrous. Fortunately, nVidia has hobbled its consumer cards’ FP64 at 1/64x rate which makes it pretty much unusable… AMD is still your choice here if you don’t have the money (who does these days?) and need FP64 native support.

Tensor (matrix multiplier) support should improve performance (GEMM, CNN/RNN neural-networks, image processing) – but is limited to FP16/Int8 and won’t help standard FP32 performance. We will likely have to wait for updated tensors in future versions – just as nVidia has done with “Ampere”.

Perhaps it was too much to expect a nVidia/AMD killer – but all in all it is a decent effort. YMMV (Your Mileage May Vary).

Summary: OK for the price, nothing special: 7/10

Please see our other articles on:

GP-GPU
- Intel DG1 (Iris Xe Max Gen12) Review & Benchmarks – GP-GPU Performance
- Intel Iris Plus G7 Gen11 IceLake ULV (i7-1065G7) Review & Benchmarks – GP-GPU Performance

Disclaimer

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!