Abstract
The runtime of a Lattice QCD simulation is dominated by a small kernel, which
calculates the product of a vector by a sparse matrix known as the "Dslash"
operator. Therefore, this kernel is frequently optimized for various HPC
architectures. In this contribution we compare the performance of the Intel
Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient
solver. By exposing more parallelism to the accelerator through inverting
multiple vectors at the same time we obtain a performance 250 GFlop/s on both
architectures. This more than doubles the performance of the inversions. We
give a short overview of both architectures, discuss some details of the
implementation and the effort required to obtain the achieved performance.
Users
Please
log in to take part in the discussion (add own reviews or comments).