J. Shen, J. Fang, H. Sips, and A. Varbanescu. Parallel, Distributed and Network-Based Processing (PDP),
2013 21st Euromicro International Conference on, page 38--45. Washington, DC, USA, IEEE Computer Society, (February 2013)
Abstract
With its design concept of cross-platform portability, OpenCL
can be used not only on GPUs (for which it is quite popular),
but also on CPUs. Whether porting GPU programs to CPUs, or
simply writing new code for CPUs, using OpenCL brings up the
performance issue, usually raised in one of two forms: ``OpenCL
is not performance portable!'' or ``Why using OpenCL for CPUs
after all?!''. We argue that both issues can be addressed by a
thorough study of the factors that impact the performance of
OpenCL on CPUs. This analysis is the focus of this paper.
Specifically, starting from the two main architectural
mismatches between many-core CPUs and the OpenCL
platform-parallelism granularity and the memory model-we
identify eight such performance ``traps'' that lead to
performance degradation in OpenCL for CPUs. Using multiple code
examples, from both synthetic and real-life benchmarks, we
quantify the impact of these traps, showing how avoiding them
can give up to 10 times better performance. Furthermore, we
point out that the solutions we provide for avoiding these traps
are simple and generic code transformations, which can be easily
adopted by either programmers or automated tools. Therefore, we
conclude that a certain degree of OpenCL inter-platform
performance portability, while indeed not a given, can be
achieved by simple and generic code transformations.
%0 Conference Paper
%1 Shen2013-yf
%A Shen, Jie
%A Fang, Jianbin
%A Sips, H
%A Varbanescu, A L
%B Parallel, Distributed and Network-Based Processing (PDP),
2013 21st Euromicro International Conference on
%C Washington, DC, USA
%D 2013
%I IEEE Computer Society
%K Benchmark_testing Data_transfer Expose GPU_program Hardware Kernel Many-core_CPUs OpenCL OpenCL_interplatform Parallel_processing Performance_portability cross-platform_portability electronic_data_interchange generic_code_transformations graphics_processing_units main_architectural_mismatches many-core_CPU memory_model multiple_code_examples multiprocessing_systems open_systems parallel_architectures parallelism_granularity performance_degradation performance_evaluation performance_traps program_compilers real-life_benchmarks
%P 38--45
%T Performance Traps in OpenCL for CPUs
%X With its design concept of cross-platform portability, OpenCL
can be used not only on GPUs (for which it is quite popular),
but also on CPUs. Whether porting GPU programs to CPUs, or
simply writing new code for CPUs, using OpenCL brings up the
performance issue, usually raised in one of two forms: ``OpenCL
is not performance portable!'' or ``Why using OpenCL for CPUs
after all?!''. We argue that both issues can be addressed by a
thorough study of the factors that impact the performance of
OpenCL on CPUs. This analysis is the focus of this paper.
Specifically, starting from the two main architectural
mismatches between many-core CPUs and the OpenCL
platform-parallelism granularity and the memory model-we
identify eight such performance ``traps'' that lead to
performance degradation in OpenCL for CPUs. Using multiple code
examples, from both synthetic and real-life benchmarks, we
quantify the impact of these traps, showing how avoiding them
can give up to 10 times better performance. Furthermore, we
point out that the solutions we provide for avoiding these traps
are simple and generic code transformations, which can be easily
adopted by either programmers or automated tools. Therefore, we
conclude that a certain degree of OpenCL inter-platform
performance portability, while indeed not a given, can be
achieved by simple and generic code transformations.
@inproceedings{Shen2013-yf,
abstract = {With its design concept of cross-platform portability, OpenCL
can be used not only on GPUs (for which it is quite popular),
but also on CPUs. Whether porting GPU programs to CPUs, or
simply writing new code for CPUs, using OpenCL brings up the
performance issue, usually raised in one of two forms: ``OpenCL
is not performance portable!'' or ``Why using OpenCL for CPUs
after all?!''. We argue that both issues can be addressed by a
thorough study of the factors that impact the performance of
OpenCL on CPUs. This analysis is the focus of this paper.
Specifically, starting from the two main architectural
mismatches between many-core CPUs and the OpenCL
platform-parallelism granularity and the memory model-we
identify eight such performance ``traps'' that lead to
performance degradation in OpenCL for CPUs. Using multiple code
examples, from both synthetic and real-life benchmarks, we
quantify the impact of these traps, showing how avoiding them
can give up to 10 times better performance. Furthermore, we
point out that the solutions we provide for avoiding these traps
are simple and generic code transformations, which can be easily
adopted by either programmers or automated tools. Therefore, we
conclude that a certain degree of OpenCL inter-platform
performance portability, while indeed not a given, can be
achieved by simple and generic code transformations.},
added-at = {2015-04-10T18:02:47.000+0200},
address = {Washington, DC, USA},
author = {Shen, Jie and Fang, Jianbin and Sips, H and Varbanescu, A L},
biburl = {https://www.bibsonomy.org/bibtex/23ab90e89ccaf536c8739c312f015852b/christophv},
booktitle = {Parallel, Distributed and {Network-Based} Processing ({PDP)},
2013 21st Euromicro International Conference on},
interhash = {05b223ae0c798f987d39e69cc8a52c25},
intrahash = {3ab90e89ccaf536c8739c312f015852b},
keywords = {Benchmark_testing Data_transfer Expose GPU_program Hardware Kernel Many-core_CPUs OpenCL OpenCL_interplatform Parallel_processing Performance_portability cross-platform_portability electronic_data_interchange generic_code_transformations graphics_processing_units main_architectural_mismatches many-core_CPU memory_model multiple_code_examples multiprocessing_systems open_systems parallel_architectures parallelism_granularity performance_degradation performance_evaluation performance_traps program_compilers real-life_benchmarks},
month = feb,
pages = {38--45},
publisher = {IEEE Computer Society},
series = {PDP '13},
timestamp = {2016-01-04T14:22:08.000+0100},
title = {Performance Traps in {OpenCL} for {CPUs}},
year = 2013
}