FP16 GPGPU Image Processing Performance & Quality

What is FP16 (“half”)?

FP16 (aka “half” floating-point) is the IEEE lower-precision floating-point representation that has recently begun to be supported by GPGPUs for compute (e.g. Intel EV9+ Skylake GPU, nVidia Pascal) while CPU support is still limited to SIMD conversion only (FP16C). It has been added to allow mobile devices (phones, tablets) to provide increased performance (and thus save power for fixed workloads) for a small drop in quality for normal 8-bbc (24-bbp) image and video.

However, normal laptops and tablets with integrated graphics can also benefit from FP16 support in same way due to relatively low graphics compute power and the need to save power due to limited battery in thin and light formats.

In this article we’re investigating the performance differences vs. standard FP32 (aka “single”) and the resulting quality difference (if any) for mobile GPGPUs (Intel’s EV9/9.5 SKL/KBL). See the previous articles for general performance comparison:

Intel Graphics GPGPU Performance

Image Processing Performance & Quality

We are testing GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Image Filter		FP32/Single	FP16/Half	Comments


	Blur (3×3) Filter OpenCL (MPix/s)	481	967 [+2x]	We see a a text-book 2x performance increase for no visible drop in quality.

	Sharpen (5×5) Filter OpenCL (MPix/s)	107	331 [+3.1x]	Using FP16 yields over 3x performance increase but we do see a few more changed pixels though no visible difference.

	Motion-Blur (7×7) Filter OpenCL (MPix/s)	112	325 [+2.9x]	Again almost 3x performance increase but no visible quality difference. Result!

	Edge Detection (2*5×5) Sobel OpenCL (MPix/s)	107	323 [+3.1x]	Again just over 3x performance increase but no visible quality difference.

	Noise Removal (5×5) Median OpenCL (MPix/s)	5.41	5.67 [+4%]	No image difference at all but also almost no performance increase – a measly 4%.

	Oil Painting Quantise OpenCL (MPix/s)	4.7	13.48 [+2.86x]	We’re back with a 2.8x times performance increase but few more differences than we’ve seen though quality seems acceptable.

	Diffusion Randomise OpenCL (MPix/s)	1188	1210 [+2%]	Due to random no generation using 64-bit integer processing the performance difference is minimal but the picture quality is not acceptable.

	Marbling Perlin Noise 2D OpenCL (MPix/s)	470	508 [+8%]	Again due to Perlin noise generation we see almost no performance gain but big drop in image quality – not worth it.

Other Image Processing relating Algorithms

Image Filter		FP16/Half	FP32/Single	FP64/Double	Comments

	GEMM OpenCL (GFLOPS)	178 [+50%]	118	35	Dropping to FP16 gives us 50% more performance, not as good as 2x but still a significant increase.
	FFT OpenCL (GFLOPS)	34 [+70%]	20	5.4	With FFT we are now 70% faster, closer to the 100% promised.
	N-Body OpenCL (GFLOPS)	297 [+49%]	199	35	Again we drop to “just” 50% faster with FP16 but still a great performance improvement.

Final Thoughts / Conclusions

For many image processing filters (Blur, Sharpen, Sobel/Edge-Detection, Median/De-Noise, etc.) we see a huge 2-3x performance increase – more than we’ve hoped for (2x) – with little or no image quality degradation. Thus FP16 support is very much useful and should be used when supported.

However for complex filters (Diffusion, Marble/Perlin Noise) the drop in quality is not acceptable for minor performance increase (2-8%); increasing the precision of more data items to improve quality (from FP16 to FP32) would further drop performance making the whole endeavour pointless.

For those algorithms that do benefit from FP16 the performance improvement with FP16 is very much worth it – so FP16 support is very useful indeed.

FP16 GPGPU Image Processing Performance & Quality

What is FP16 (“half”)?

Image Processing Performance & Quality

Other Image Processing relating Algorithms

Final Thoughts / Conclusions

One Response to FP16 GPGPU Image Processing Performance & Quality