copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Performance Traps in OpenCL for CPUs

J. Shen, J. Fang, H. Sips, and A. Varbanescu. Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, page 38--45. Washington, DC, USA, IEEE Computer Society, (February 2013)

Abstract

With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: ``OpenCL is not performance portable!'' or ``Why using OpenCL for CPUs after all?!''. We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance ``traps'' that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL inter-platform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.

Links and resources

BibTeX key: Shen2013-yf
entry type: inproceedings
address: Washington, DC, USA
booktitle: Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on
year: 2013
month: feb
pages: 38--45
publisher: IEEE Computer Society
series: PDP '13

Cite this publication

%0 Conference Paper %1 Shen2013-yf %A Shen, Jie %A Fang, Jianbin %A Sips, H %A Varbanescu, A L %B Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on %C Washington, DC, USA %D 2013 %I IEEE Computer Society %K Benchmark_testing Data_transfer Expose GPU_program Hardware Kernel Many-core_CPUs OpenCL OpenCL_interplatform Parallel_processing Performance_portability cross-platform_portability electronic_data_interchange generic_code_transformations graphics_processing_units main_architectural_mismatches many-core_CPU memory_model multiple_code_examples multiprocessing_systems open_systems parallel_architectures parallelism_granularity performance_degradation performance_evaluation performance_traps program_compilers real-life_benchmarks %P 38--45 %T Performance Traps in OpenCL for CPUs %X With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: ``OpenCL is not performance portable!'' or ``Why using OpenCL for CPUs after all?!''. We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance ``traps'' that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL inter-platform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.

@inproceedings{Shen2013-yf, abstract = {With its design concept of cross-platform portability, OpenCL can be used not only on GPUs (for which it is quite popular), but also on CPUs. Whether porting GPU programs to CPUs, or simply writing new code for CPUs, using OpenCL brings up the performance issue, usually raised in one of two forms: ``OpenCL is not performance portable!'' or ``Why using OpenCL for CPUs after all?!''. We argue that both issues can be addressed by a thorough study of the factors that impact the performance of OpenCL on CPUs. This analysis is the focus of this paper. Specifically, starting from the two main architectural mismatches between many-core CPUs and the OpenCL platform-parallelism granularity and the memory model-we identify eight such performance ``traps'' that lead to performance degradation in OpenCL for CPUs. Using multiple code examples, from both synthetic and real-life benchmarks, we quantify the impact of these traps, showing how avoiding them can give up to 10 times better performance. Furthermore, we point out that the solutions we provide for avoiding these traps are simple and generic code transformations, which can be easily adopted by either programmers or automated tools. Therefore, we conclude that a certain degree of OpenCL inter-platform performance portability, while indeed not a given, can be achieved by simple and generic code transformations.}, added-at = {2015-04-10T18:02:47.000+0200}, address = {Washington, DC, USA}, author = {Shen, Jie and Fang, Jianbin and Sips, H and Varbanescu, A L}, biburl = {https://www.bibsonomy.org/bibtex/23ab90e89ccaf536c8739c312f015852b/christophv}, booktitle = {Parallel, Distributed and {Network-Based} Processing ({PDP)}, 2013 21st Euromicro International Conference on}, interhash = {05b223ae0c798f987d39e69cc8a52c25}, intrahash = {3ab90e89ccaf536c8739c312f015852b}, keywords = {Benchmark_testing Data_transfer Expose GPU_program Hardware Kernel Many-core_CPUs OpenCL OpenCL_interplatform Parallel_processing Performance_portability cross-platform_portability electronic_data_interchange generic_code_transformations graphics_processing_units main_architectural_mismatches many-core_CPU memory_model multiple_code_examples multiprocessing_systems open_systems parallel_architectures parallelism_granularity performance_degradation performance_evaluation performance_traps program_compilers real-life_benchmarks}, month = feb, pages = {38--45}, publisher = {IEEE Computer Society}, series = {PDP '13}, timestamp = {2016-01-04T14:22:08.000+0100}, title = {Performance Traps in {OpenCL} for {CPUs}}, year = 2013 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Performance Traps in OpenCL for CPUs

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Performance Traps in OpenCL for CPUs

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Performance Traps in OpenCL for CPUs

Comments and Reviews
(0)