Abstract
Most investigations into near-memory hardware accelerators for deep neural
networks have primarily focused on inference, while the potential of
accelerating training has received relatively little attention so far. Based on
an in-depth analysis of the key computational patterns in state-of-the-art
gradient-based training methods, we propose an efficient near-memory
acceleration engine called NTX that can be used to train state-of-the-art deep
convolutional neural networks at scale. Our main contributions are: (i)
identifying requirements for efficient data address generation and developing
an efficient accelerator offloading scheme reducing overhead by 7x over
previously published results; (ii) support a rich set of operations allowing
for efficient calculation of the back-propagation phase. The low control
overhead allows up to 8 NTX engines to be controlled by a simple processor.
Evaluations in a near-memory computing scenario where the accelerator is placed
on the logic base die of a Hybrid Memory Cube demonstrate a 2.6x energy
efficiency improvement over contemporary GPUs at 4.4x less silicon area, and an
average compute performance of 1.01 Tflop/s for training large state-of-the-art
networks with full floating-point precision. The architecture is scalable and
paves the way towards efficient deep learning in a distributed near-memory
setting.
Users
Please
log in to take part in the discussion (add own reviews or comments).