AMD A4 “Mullins” APU GPGPU (Radeon R4): Time does not stand still…

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • GPU: Core remains the same (GCN)

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding AMD GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of the new AMD APU with the old version as well as its competition from Intel.

Graphics Unit CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comment
Graphics Core B-GT EV8 B-GT2Y EV8 GCN GCN There is no change in GPU core in Mullins, it appears to be a re-brand fromm 83XX to R4. But time does not stand still, so while Kabini went against BayTrail’s “crippled” EV7 (IvyBridge) GPU – Mullins must battle the “beefed-up” brand-new EV8 (Broadwell) GPU. We shall see if the old GCN core is enough…
APU / Processor Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) The series has changed but not much else, not even the CPU core.
Cores (CU) / Shaders (SP) / Type 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) 2C / 128SP 2C / 128SP [=] We still have 2 GCN Compute Units but now they go against 16 EV8 units rather than 4 EV7 units. You can see just how much Intel has improved the Atom GPGPU from generation to generation while AMD has not. Will this cost them dearly?
Speed (Min / Max / Turbo) MHz 200 – 600 200 – 800 266 – 496 266 – 500 [=] Nope, clock has not changed either in Mullins.
Power (TDP) W 2.4 (under 4) 4.5 15 15 [=] As before, Intel’s designs have a crushing advantage over AMD’s: both Kabini and Mullins are rated at least 3x (three times) higher power than Core M and as much as 5-6x more than new Atom. Powered-down versions (6W?) would still consume more while performing worse.
DirectX / OpenGL / OpenCL Support 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 11.2 (12?) / 4.5 / 1.2 11.2 (12?) / 4.5 / 2.0 GCN supports DirectX 11.2 (not a big deal) and also OpenCL 4.5 (vs 4.3 on Intel but including Compute) and OpenCL 2.0 (same). All designs should benefit from Windows 10’s DirectX 12. So while AMD supports newer versions of standards there’s not much in it.
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) No / Yes No / Yes Sadly even AMD does not support FP16 (why?) but does support FP64 (double-float) in all interfaces – while Atom/Core GPU only in DirectX and OpenGL. Few people would elect to run heavy FP64 compute on these GPUs but it’s good to know it’s there…
Threads per CU 256 (256x256x256) 512 (512x512x512) 256 (256x256x256) 256 (256x256x256) GCN has traditionally not supported large number of threads-per-CPU (256) and here’s no different, with Intel’s GPU now supporting twice as many (512) – but whether this will make a difference remains to be seen.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (July 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: GPGPU Vectorised
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 160.5 181.7 165.3 163.9 [-1%] Straight off the bat, we see no change in score in Mullins; however, Atom has cought up – scoring within a whisker and Core M faster still (+13%). Not what AMD is used to seeing for sure.
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 160 180 165 163.9 [-1%] As FP16 is not supported by any of the GPUs, unsurprisingly the results don’t change.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 10.1 (emulated) 11.6 (emulated) 13.4 14 [+4%] We see a tiny 4% improvement in Mullins but due to native FP64 support it is almost 40% faster than both Intel GPUs.
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.08 (emulated) 1.32 (emulated) 0.731 (emulated) 0.763 [+4%] (emulated) No GPU supports FP128, but GCN can emulate it using FP64 while EV8 needs to use more complex FP32 maths. Again we see a 4% improvement in Mullins, but despite FP64 support both Intel GPUs are much faster. Sometimes FP64/FP32 ratio is so low that it’s not worth using FP64 and emulation can be faster (e.g nVidia).
AMD Mullins: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 825 770 998 1024 [+3%] In this tough integer workload that uses shared memory (as cache), Mullins only sees a 3% improvement. GCN shows its power being 25% faster than Intel’s GPUs – TDP notwhitstanding.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1106 ? 1280 1423 [+11%] With less rounds, Mullins is now 11% faster – finally a good improvement and again 28% faster than Atom’s GPU.
AMD Mullins: GPGPU Hash
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 309 ? 59 282 [+4.7x] This 64-bit integer compute-heavy wokload seems to have triggered a driver bug in Kabini since Mullins is almost 5x (five times) faster – perhaps 64-bit integer operations were emulated using int32 rather than native? Surprisingly Atom’s EV8 is faster (+9%) – not something we’d expect to see.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 1187 1331 1618 1638 [+1%] In this integer compute-heavy workload, Mullins is just 1% faster (within margin of error) – which again proves GPU has not changed at all vs. older Kabini. At least it’s faster than both Intel GPUs, 38% faster than Atom’s.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2764 ? 2611 3256 [+24%] SHA1 is less compute-heavy but here we see a 24% Mullins improvement, again likely a driver “fix”. This allows it to beat Atom’s GPU 17% – showing that driver optimisations can make a big difference.
AMD Mullins: GPGPU Financial
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 299.8 280.5 248.3 326.7 [+31%] Starting with the financial tests, Mullins flies off with a 31% improvement over the old Kabini, with is just as well as both Intel GPUs are coming strong – it’s 9% faster than Atom’s GPU. One thing’s for sure, Intel’s EV8 GPU is no slouch.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) n/a (no FP64) n/a (no FP64) 21 21.2 [+1%] AMD’s GCN supports native FP64, but here Mullins is just 1% faster than Kabini (within margin of error), unable to replicate the FP32 improvement we saw.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 28 36.5 32.3 30.9 [-4%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and here Mullins somehow manages to be slower (-4%) – likely due to driver differences. Both Intel GPUs are coming strong, with Core M’s GPU 20% faster. Considering how fast the GCN shared memory is – we expected better.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 1.85 1.87 [+1%] Switching to FP64 on AMD’s GPUs, Mullins is now 1% faster (within margin of error). Luckily Intel’s GPUs do not support FP64.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 61.9 54.9 32.9 46.3 [+40%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Mullins is 40% faster here (again driver change) – but surprisingly cannot match either Intel GPUs, with Atom’s GPU 32% faster! Again, we see just how much Intel has improved the GPU in Atom – perhaps too much!
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 5.39 5.59 [+3%] Switching to FP64 we now see a little 3% improvement for Mullins, but better than the 1% we saw before…
AMD Mullins: GPGPU Scientific
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 45 44.1 43.5 41.5 [-5%] GEMM is quite a tough algorithm for our GPUs and Mullins manages to be 5% slower than Kabini – agin this allows Intel’s GPUs to win, with Atom’s GPU just 8% faster – but a win is a win. Mullins’s GPU is starting to look underpowered considering the much higher TDP.
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.11 3.73 [-9%] Swithing to FP64, Mullins now manages to be 5% slower than Kabini – thankfully Intel’s FPUs don’t support FP64…
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 9 8.94 7.89 9.5 [+20%] FFT involves many kernels processing data in a pipeline – and Mullins now manages to be 20% faster than Kabini – again, just as well as Intel’s GPUs are hot on its tail – and it is just 5% faster than Atom’s GPU!
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 2.2 3 [+36%] Switching to FP64, Mullins is now 36% faster than Kabini – again likely a driver improvement.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 65 50 58 63 [+9%] In our last test we see Mullins is 9% faster – but not enough to beat Atom’s GPU which is 1% faster but faster still. Anybody expected that?
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.75 4.74 [=] Switching to FP64, Mullins scores exactly the same as Kabini.

Firstly, Mullins’s GPU scores are unchanged from Kabini; due to driver optimisations/fixes (as well as kernel optimisations) sometimes Mullins is faster but that’s not due to any hardware changes. If you were expecting more, you are to be disappointed.

Intel’s EV8 GPUs in the new Atom (CherryTrail) as well as Core M (Broadwell) now can keep up with it and even beat it in some tests. The crushing GPGPU advantage AMD’s APUs used to have is long gone. Considering the the TDP differences (4-5x higher) the Mullins’s GPU looks underpowered – the number of cores should at least been doubled to maintain its advantage.

Unless Atom (CherryTrail) is more expensive – there’s really no reason to choose Mullins, the power advantage of Atom is hard to be denied. The only positive for AMD is that Core M looks uncompetitive vs. Atom itself, but then again Intel’s 15W ULV designs are far more powerful.

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) AMD h264Encoder (hardware accelerated) AMD h264Encoder (hardware accelerated) Both are using their own hardware-accelerated transcoders for H264.
AMD Mullins: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 5 ? 2 2.14 [+7%] We see a small but useful 7% bandwidth improvement in Mullins vs. Kabini, but even Atom is over 2x (twice) as fast.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 4.75 ? 2 2.07 [+3.5%] When just using the H264 encoder we only see a small 3.5% bandwidth improvement in Mullins. Again, Atom is over 2x as fast.

We see a minor 3.5-7% improvement in Mullins, but the new Atom blows it out of the water – it is over twice as fast transcoding H.264! Poor Mullins/Kaveri are not even a good fit for HTPC (NUC/Brix) boxes…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
Memory Configuration 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) Except Core M, all APUs have a single memory controller, though Atom can also be configured with 2 channels.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 32 64 / 32 Surprisingly AMD’s GCN has 1/2 the shared memory of Intel’s EV8 (32 vs. 64) but considering the low number of threads-per-CU (256) only kernels making very heavy use of shared memory would be affected, still better more than less.
L1 / L2 / L3 Caches (kB) 256kB? L2 256kB? L2 16kB? L1 / 256kB? L2 16kB? L1 / 256kB? L2 Caches sizes are always pretty “hush hush” but since core has not changed, we would expect the same cache sizes – with GCN also sporting a L1 data cache.
AMD Mullins: GPGPU Memory BW
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 11 10.1 8.8 5.5 [-38%] OpenCL memory performance has surprisingly taken a bit hit in Mullins, most likely a driver bug. We shall see whether DirectX Compute is similarly affected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 2.09 3.91 4.1 2.88 [-30%] Upload bandwidth is similarly affected, we measure a 30% decrease.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 2.29 3.79 3.18 2.9 [-9%] Upload bandwidth is the least affected, just 9% lower.
AMD Mullins: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 829 274 1973 693 [-1/3x] Even though the core is unchanged, latency is 1/3 of Kabini. Since we don’t see a comparative increase in performance, this again points to a driver issue.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1669 ? ? 817 Going out-of-page does not increase latency much.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 279 ? ? 377 Sequential access brings the latency down to about 1/2, showing the prefetchers do a good job.
GPGPU Memory Latency Constant Memory Latency (ns) 209 ? 629 401 [-33%] The L1 (16kB) cache does not cover the whole constant memory (64kb) – and is not lower than global memory. There is no advantage to using constant memory.
GPGPU Memory Latency Shared Memory Latency (ns) 201 ? 20 16 [-20%] Shared memory is a little big faster (20% lower). We see that shared memory latency is much lower than constant/global lantency (16 vs. 401) – denoting dedicated shared memory. On Intel’s EV8 GPU there is no change (201 vs. 209) – which would indicate global memory used as shared memory.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1234 ? 2369 691 [-70%] We see a massive latency reduction – again likely a driver optimisation/fix.
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 353 ? ? 353 Sequential access brings the latency down to a quarter (1/4x) – showing the power of the prefetchers.

The optimisations in newer drivers make a big difference – though the same could apply to the previous gen (Kabini). The dedicated shared memory – compared to Intel’s GPUs – likely help GCN achieve its performance.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Shader
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 127.6 ? 128.8 129.6 [=] Starting with DirectX FP32, we see no change in Mullins – not even the DirectX driver has changed.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 121.8 172 124 124.4 [=] OpenGL does not change matters, Mullins scores exactly the same as its predecessor. But here we see Core M pulling ahead, an unexpected change.
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 109.5 ? 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 121.8 170 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change either.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 18 ? 8.9 8.91 [=] Unlike OpenCL driver, DirectX Intel driver does support FP64 – which allows Atom’s GPU to be at least 2x (twice) as fast as Mullins/Kebini.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 26 46 12 12 [=] As above, Intel OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. This allows Atom’s GPU to be over 2x faster than Mullins/Kabini again – while Core M’s GPU is almost 4x (four times!) faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.34 (emulated) ? 1.6 (emulated) 1.66 (emulated) [+3%] Here we’re emulating (mantissa extending) FP128 using FP64: EV8 stumbles a bit allowing Mullins/Kabini to be a little bit faster despite what we saw in FP64 test.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1.1 (emulated) 3.4 (emulated) 0.738 (emulated) 0.738 (emulated) [=] OpenGL does change the results a bit, Atom’s GPU is now faster (+50%) while Core M’s GPU is far faster (+5x). Heavy shaders seem to take their toll on GCN’s GPU.

Unlike GPGPU, here Mullins scores exactly the same as Kabini – neither the DirectX nor OpenGL driver seem to make a difference. But what is different is that Intel’s GPUs support FP64 natively in both DirectX/OpenGL – making it much faster 3-5x than AMD’s GCN. If OpenCL driver were to support it – AMD woud be in trouble!

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 11.18 12.46 8 9.7 [+21%] DirectX bandwidth does not seem to be affected by the OpenCL “bug”, here we see Mullins having 21% more bandwidth than Kaveri using the very same memory. Perhaps the memory controller has seen some some improvements after all.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.83 5.29 3 3.61 [+20%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again Mullins does well with 20% more bandwidth.
Video Memory Benchmark Download Bandwidth (GB/s) 2.1 1.23 3 3.34 [+11%] Download bandwidth improves by 11% only, but better than nothing.

Unlike OpenCL, we see DirectX bandwidth increased by 11-20% – while using the same memory. Hopefuly AMD will “fix” the OpenCL issue which should help kernel performance no end.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mullins’s GPU is unchanged from its predecessor (Kabini) but a few driver optimisations/fixes allow it to score better is many tests by a small margin – however these would also apply to the older devices. There isn’t really more to be said – nothing has really changed.

But time does not stand still – and now Intel’s EV8 GPU that powers the new Atom (CherryTrail) as well as Core M (Broadwell) is hot on its heels and even manages to beat it in some tests – not something we’re used to seeing in AMD’s APUs. Mullins’s GPU is looking underpowered.

If we now remember that Mullins’s TDP is 15W vs. Atom at 2.6-4W or Core M at 4.6W – it’s really not looking good for AMD: it’s CPU performance is unlikely to be much better than Atom’s (we shall see in CPU AMD A4 “Mullins” performance) – and at 3-5x (three to five times) more power woefully power inefficient.

Let’s hope that the next generation APUs (aka Mullins’ replacement) perform better.

Tagged , , , . Bookmark the permalink.

Comments are closed.