Аннотация
The impact of gradient noise on training deep models is widely acknowledged
but not well understood. In this context, we study the distribution of
gradients during training. We introduce a method, Gradient Clustering, to
minimize the variance of average mini-batch gradient with stratified sampling.
We prove that the variance of average mini-batch gradient is minimized if the
elements are sampled from a weighted clustering in the gradient space. We
measure the gradient variance on common deep learning benchmarks and observe
that, contrary to common assumptions, gradient variance increases during
training, and smaller learning rates coincide with higher variance. In
addition, we introduce normalized gradient variance as a statistic that better
correlates with the speed of convergence compared to gradient variance.
Пользователи данного ресурса
Пожалуйста,
войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)