Intel ARC (“Alchemist”) A380 GP-GPU Graphics – OpenCL Performance

What is “Alchemist”?

Is is the 2nd generation Intel discrete (DG2) graphics – based on the Xe arch that originally powered the integrated graphics of mobile (Xe-LP) processors from Gen 10 (“TigerLake” TGL) onwards. Intel did launch a limited desktop DG1 product as well as a limited server/workstation Xe-HP compute version.

With DG2 – Intel is back with a full stack, from low-end, middle and high-end – as well as workstation/server (data as per Tom’s Hardware : Intel Arc Alchemist: Release Date, Specs, Everything We Know):

  • Up to 512CU / 4,096 SP
    • A300 series 128CU / 1,024SP
    • A500 series 384CU / 3,072SP
    • A700 series 512CU / 4,096SP
  • Up to 16GB GDDR6
    • A300? series 6GB GDDR6 96-bit 14Gbit/s ~192GB/s
    • A500? series 12GB GDDR6 192-bit 16Gbit/s ~384GB/s
    • A700? series 16GB GDDR6 256-bit 16Gbit/s ~512GB/s
  • 7nm TSMC process
  • OpenCL 3.0, DirectX 12 Ultimate support
  • FP16 half-floating point support
  • NO FP64 double floating-point support (!)
  • Tensors (“Matrix Engines” aka XMX) for Deep Learning
    • FP16 and Int8 support (but no FP32/FP64 support)
    • OpenCL extension – will need to be coded specifically to use it

The specification look good – we have competitive specs, although we seem to be missing native FP64 support, just as on the mobile/desktop previous Xe. We do gain tensors (matrix multipliers) and support for FP16/Int8 that should help low-precision neural networks. For high-precision workloads (FP64), it seems AMD is your only option – unless paying nVidia for professional graphics…

ARC “Alchemist” A380 GP-GPU Performance Benchmarking

In this article we test GP-GPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the “entry-level” range A380 GP-GPU with the competition with a view to upgrading an inexpensive system.

Specifications Intel Arc A380 (DG2)
nVidia GeForce GTX 1660 Ti nVidia GeForce RTX 3050 AMD Radeon RX 6500 XT Comments
Arch / Chipset Xe2 (EV12+) TU116 (Turing) GA106 (Ampere) Navi II 24 XT Gen 12+ graphics – the latest.
Cores (CU) / Threads (SP) 128  / 1,024 12 / 1,536 20 / 1,280 16 / 1,024 Similar SPs count as competition
Tensor (TU) Cores 16 20 Now with extra tensors on top!
Speed (Min-Turbo)
~2.45GHz
1.5-1.78GHz 1.55-1.78GHz 2.3-2.6GHz Turbo speed is pretty high.
Power (TDP) ~75W ~120W ~130W ~107W TDP is lowest… on paper at least
ROP / TMU 32 / 64 48 / 96 32 / 80 32 / 64 ROPs and TMUs also increased.
Shared Memory
64kB
48kB 96kB 64kB Standard shared memory size.
Constant Memory
2.4GB 64kB 64kB 2GB Using global as const sadly like AMD.
Global Memory 6GB GDDR6 96-bit
6GB GDDR6 192-bit 8GB GDDR6 128-bit 4GB GDDR6 64-bit Quite narrow bus but decent memory
Memory Bandwidth
~192GB/s ~288GB/s ~224GB/s ~144GB/s Somewhat low bandwidth
L1 Caches 64kB 64kB 128kB 128kB L1 could be larger
L2 / L3 Cache 1MB 1.5MB 2MB 1MB  / 16MB L2 seems quite small
Maximum Work-group Size
1,024 x 1,024 1,024 x 1,024 1,024 x 1,024 1,024 x 1,024 Same workgroup size
FP64/double ratio
No! Yes 1/32x Yes 1/64x Yes 1/16x No FP64 support?
FP16/half ratio
Yes, 2x Yes, 2x Yes, 2x Yes, 2x Same 2x ratio
Price / RRP (USD)
~$199?
~$280 ~$249 ~$199 Keen price

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

The review contains only public information and not provided under NDA nor embargoed. At publication time, the products have not been directly tested by SiSoftware but submitted to the public Benchmark Ranker; thus the accuracy of the benchmark scores cannot be verified, however, they appear consistent and pass current validation checks.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Native OpenCL Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel graphics drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel Arc A380 (DG2) OpenCL nVidia GeForce GTX 1660 Ti (Turing) CUDA nVidia GeForce RTX 3050 (Ampere)  CUDA AMD Radeon RX 6500 XT (Navi II) OpenCL
Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 12,948 [-17%] 15,401 15,612 11,057 ARC is 17% slower than nVidia.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 6,389 [-44%] 8,896 11,507 7,347 Standard FP32 is almost 1/2x slower.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 312* [+22%] 289 256 591 Without native support, ARC is 22% faster!
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 33.95*  [+3.5x] 10.94 9.65 Emulated FP128 is over 3x faster!
Starting off, we see competitive FP16 performance (beating 2 out of 3) but FP32 performance is the lowest of them all.

Despite having no native support for FP64 – Sandra’s FP32 emulation is faster than the 1/32 or 1/64 ratio of nVidia’s GPUs. AMD is the only one with good FP64 performance.

* Emulated FP64 through FP32, no native support (!)

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 6.83 [1/5x] 18.19 30.62 24.03 AES does not suit ARC, it is 1/5x speed of nVidia!
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 9.84 25.1 Nothing much changes here sadly.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 64.16 [+5%] 99.06 60.87 70.39 ARC is competitive here but not fastest.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 74.19 134.6 With SHA1 nothing much changes.
Unlike what we’ve seen when we tested the mobile/desktop Xe versions – ARC does seem not do well in integer workloads, with AES in particular being very slow (some fixes needed, hello Intel?) while hashing is competitive but nothing special.

nVidia (1660) still rules here though AMD usually does well in hashing workloads due to OpenCL extensions that we use for accelerating hashing. We need more!

GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 7,363 B/S benefits from FP16 reduction.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 4,007 [-40%] 8,349 6,632 3,815 With FP32, ARC is 40% slower.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 1,157 Binomial uses thread shared data thus stresses the memory system.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 1,116 [-48%] 1,553 2,128 1,285 With FP32, ARC is 48% slower
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 3,340 Monte-Carlo also uses thread shared data but read-only.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 3,427 [-32%] 4,253 5,039 3,276 With FP32 ARC is 32% slower.
For financial FP32/FP16 workloads, ARC is 30-50% slower than the competition, not something we would expect. Unlike what we’ve seen with fractals, FP16 does not always bring big gains (except on Black-Scholes).

No point to test or mention lack of native FP64 again. You are not going to be running 64-bit financial workloads on this GP-GPU. For a discrete GP-GPU this is a bit unexpected but then again it’s low-end.

GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 2,987 1,247 ARC does well against nVidia.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 2,454 [-55%] 3,390 5,405 1,942 With FP32, ARC is 55% slower.
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 218 438 ARC is 1/2 nVidia here.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 135 [-56%] 285 248 137 With FP32, ARC is 56% slower.
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 5,728 4,110 ARC is 25% faster with FP16.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 3,268 [-27%] 3,007 4,462 2,554 With FP32, ARC is 27% slower
On scientific algorithms (FP32 and FP16), ARC does not do much better – it is again between 25-55% slower than the nVidia competition but at least seems to beat AMD. Hopefully deploying the tensors will improve performance and allow it to gain ground.

Shall we mention lack of FP64 again? No we won’t.

GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 7,351 10,508 FP16 performance is better but not 2x.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 5,608 [-56%] 8,051 10,012 9,350 In this 3×3 convolution, ARC is 1/2x speed
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 3,748 3,915 Again FP16 is 2x faster.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,526 [-66%] 2,161 3,482 3,629 Same algorithm but more shared data, 66% slower.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 1,982 3,814 FP16 is 50% faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 1,509 [-57%] 2,231 3,519 3,207 With even more data ARC is 57% slower.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 3,421 3,761 FP16 is almost 2x faster but not fastest.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,530 [-55%] 2,256 3,412 3,625 Still convolution but with 2 filters – 55% slower.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 36.12 51.51 FP16 is over 2x faster!
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 16.72 [1/5x] 26.02 75.59 37.64 Different algorithm ARC is 1/5 the speed!
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 61.76 24.91 FP16 is just 4% faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 126 [3.2x] 29.45 38.6 37.64 Without major processing, ARC is 3x faster here.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 16,216 8,311 FP16 is not much faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 16,147 [-33%] 19,772 21,347 14,779 This algorithm is 64-bit int heavy ARC is 33 slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 2,599 622 FP16 is somehow slower.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 3,967 [6x] 779 665 6,600 One of the most complex filters, ARC is 6x faster.
For image processing tasks (FP16, FP32), ARC does not improve much – at least it beats AMD – but nVidia has nothing to worry about. (no FP64 processing here, few picture filters would require that kind of precision 😉

FP16 performance is encouraging, though again we have some questionable results – again optimisations and fixes may be needed to take full advantage of improved processing speed. (Hello, Intel?)

GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 137 [-29%] 238 194 114 ARC has 30% less bandwidth.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 5.23 [-55%] 12 11.71 2.88 Uploads seem to be very slow
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 5.21 [-57%] 12 12.25 3.06 Downloads are also slow.
ARC seems to have the lowest bandwidth (except AMD) and upload/downloads through the PCIe4 bus are also very slow for some reason. nVidia just dominates everything here. Seems the OpenCL code-path requires some more workarounds.
Aggregate Score (Points) 7,661 [-42%] 11,870 13,230 8,350 Across all benchmarks, ARC is 42% slower.
Despite good results overall, ARC seems to still needs some driver work/optimisations to reach its full potential. It ends up 40% slower than nVidia’s 1660 Ti that has the advantage of rock-solid CUDA against variable OpenCL performance.
Price/RRP (USD) ~$199 [-20%] ~$280 ~$249 ~$199 The price is good at 20% lower than nVidia.
Price Efficiency (Perf. vs. Cost) (Points/USD) 38.50 [-28%] 42.39 53.13 41.96 ARC ends up the 28% less price effective.
Going by RRP prices, ARC ends up the least value out of them all – 28% less “bang-per-buck” than nVidia’s 1660 Ti; it is lower than even AMD’s Navi II.

Yes, nVidia still rules despite the higher price – you do get much better overall performance – but you need to pony up more money. You get what you pay for, right?

Power/TDP (W) ~75W [-42%] ~120W ~130W ~107W ARC’s TDP is 42% lower than nVidia’s.
Power Efficiency (Perf. vs. Power) (W) 102 [=] 99 102 78.04 Somehow ARC ends up absolutely tied with nVidia.
Going by TDP (which may or may not reflect reality) – ARC ends up absolutely tied up with nVidia’s 1660 Ti – no mean feat. nVidia’s 3050 ends up a whisker less efficient and AMD brings up the rear.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Summary: OK for the price, nothing special: 7/10

Intel is back doing discrete graphics – yay! There are more options at the low end – yay! No need to pair discrete AMD or nVidia GP-GPU with Intel CPU/Mainboard – err… not so fast…

nVidia has nothing to worry about – here we’ve only tested the low-end – and despite being more expensive it is definitely worth it. CUDA performance is the most consistent and even the “old” Turing has no problem dispatching both Intel ARC and AMD “Navi II” competition. Again, OpenCL support still leaves a lot to be desired (both Intel and AMD) and DirectX is really for games, not optimised for compute. Still, the A380 is competitive and should improve with drivers and optimisations – should Intel decide to support ISVs like ourselves – but your 2nd choice should be AMD as Navi has been around much longer and the drivers are mature.

As with integrated graphics versions, we lack FP64 support though at this (entry) level it does not matter – though for the top-end (Xe-HP) lack of it would be quite disastrous. Fortunately, nVidia has hobbled its consumer cards’ FP64 at 1/64x rate which makes it pretty much unusable… AMD is still your choice here if you don’t have the money (who does these days?) and need FP64 native support.

Tensor (matrix multiplier) support should improve performance (GEMM, CNN/RNN neural-networks, image processing) – but is limited to FP16/Int8 and won’t help standard FP32 performance. We will likely have to wait for updated tensors in future versions – just as nVidia has done with “Ampere”.

Perhaps it was too much to expect a nVidia/AMD killer – but all in all it is a decent effort. YMMV (Your Mileage May Vary).

Summary: OK for the price, nothing special: 7/10

Please see our other articles on:

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

The review contains only public information and not provided under NDA nor embargoed. At publication time, the products have not been directly tested by SiSoftware but submitted to the public Benchmark Ranker; thus the accuracy of the benchmark scores cannot be verified, however, they appear consistent and pass current validation checks.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Tagged , , . Bookmark the permalink.

Comments are closed.