Аннотация
As the complexity of machines and architectures has increased,
performance tuning has become more challenging, leading to the
failure of general compilers to generate the best possible
optimized code. Expert performance programmers can often
hand-write code that outperforms compiler-optimized low-level
code by an order of magnitude. At the same time, the complexity
of programs has also increased, with modern programs built on a
variety of abstraction layers to manage complexity, yet these
layers hinder efforts at optimization. In fact, it is common to
lose one or two additional orders of magnitude in performance
when going from a low-level language such as Fortran or C to a
high-level language like Python, Ruby, or Matlab. General purpose
compilers are limited by the inability of program analysis to
determine programmer intent, as well as the lack of detailed
performance models that always determine the best executable code
for a given computation and architecture. The latter problem can
be mitigated through auto-tuning, which generates many code
variants for a particular problem and empirically determines
which performs best on a given architecture. This thesis
addresses the problem of how to write programs at a high level
while obtaining the performance of code written by performance
experts at the low level. To do so, we build domain-specific
embedded languages that generate low-level parallel code from a
high-level language, and then use auto-tuning to determine the
best performing low-level code. Such DSELs avoid analysis by
restricting the domain while ensuring programmers specify
high-level intent, and by performing empirical auto-tuning
instead of modeling machine parameters. As a result, programmers
write in high-level languages with portions of their code using
DSELs, yet obtain performance equivalent to the best
hand-optimized low-level code, across many architectures. We
present a methodology for building such auto-tuned DSELs, as well
as a software infrastructure and example DSELs using the
infrastructure, including a DSEL for structured grid computations
and two DSELs for graph algorithms. The structured grid DSEL
obtains over 80\% of peak performance for a variety of benchmark
kernels across different architectures, while the graph algorithm
DSELs mitigate all performance loss due to using a high-level
language. Overall, the methodology, infrastructure, and example
DSELs point to a promising new direction for obtaining high
performance while programming in a high-level language.
Пользователи данного ресурса
Пожалуйста,
войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)