Аннотация
We consider learning two layer neural networks using stochastic gradient
descent. The mean-field description of this learning dynamics approximates the
evolution of the network weights by an evolution in the space of probability
distributions in $R^D$ (where $D$ is the number of parameters associated to
each neuron). This evolution can be defined through a partial differential
equation or, equivalently, as the gradient flow in the Wasserstein space of
probability distributions. Earlier work shows that (under some regularity
assumptions), the mean field description is accurate as soon as the number of
hidden units is much larger than the dimension $D$. In this paper we establish
stronger and more general approximation guarantees. First of all, we show that
the number of hidden units only needs to be larger than a quantity dependent on
the regularity properties of the data, and independent of the dimensions. Next,
we generalize this analysis to the case of unbounded activation functions,
which was not covered by earlier bounds. We extend our results to noisy
stochastic gradient descent.
Finally, we show that kernel ridge regression can be recovered as a special
limit of the mean field analysis.
Пользователи данного ресурса
Пожалуйста,
войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)