Artikel,

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

U. Şimşekli, M. Gürbüzbalaban, T. Nguyen, G. Richard, und L. Sagun.
(2019)cite arxiv:1912.00018Comment: 32 pages. arXiv admin note: substantial text overlap with arXiv:1901.06053.

Zusammenfassung

The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT, which suggests that the GN converges to a heavy-tailed $\alpha$-stable random vector, where tail-index $\alpha$ determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a Lévy motion. Such SDEs can incur `jumps', which force the SDE and its discretization transition from narrow minima to wider minima, as proven by existing metastability theory and the extensions that we proved recently. In this study, under the $\alpha$-stable GN assumption, we further establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$. To validate the $\alpha$-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

BibTeX-Schlüssel: simsekli2019heavytailed
Eintragstyp: article
Jahr: 2019
URL: http://arxiv.org/abs/1912.00018
Hinweis: cite arxiv:1912.00018Comment: 32 pages. arXiv admin note: substantial text overlap with arXiv:1901.06053

Nutzer

Kommentare und Rezensionenanzeigen / verbergen

Bitte melden Sie sich an um selbst Rezensionen oder Kommentare zu erstellen.

Zitieren Sie diese Publikation

%0 Journal Article %1 simsekli2019heavytailed %A Şimşekli, Umut %A Gürbüzbalaban, Mert %A Nguyen, Thanh Huy %A Richard, Gaël %A Sagun, Levent %D 2019 %K bounds deep-learning optimization readings %T On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks %U http://arxiv.org/abs/1912.00018 %X The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT, which suggests that the GN converges to a heavy-tailed $\alpha$-stable random vector, where tail-index $\alpha$ determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a Lévy motion. Such SDEs can incur `jumps', which force the SDE and its discretization transition from narrow minima to wider minima, as proven by existing metastability theory and the extensions that we proved recently. In this study, under the $\alpha$-stable GN assumption, we further establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$. To validate the $\alpha$-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

@article{simsekli2019heavytailed, abstract = {The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the \emph{generalized} CLT, which suggests that the GN converges to a \emph{heavy-tailed} $\alpha$-stable random vector, where \emph{tail-index} $\alpha$ determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a L\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE and its discretization \emph{transition} from narrow minima to wider minima, as proven by existing metastability theory and the extensions that we proved recently. In this study, under the $\alpha$-stable GN assumption, we further establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$. To validate the $\alpha$-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.}, added-at = {2019-12-04T20:33:05.000+0100}, author = {Şimşekli, Umut and Gürbüzbalaban, Mert and Nguyen, Thanh Huy and Richard, Gaël and Sagun, Levent}, biburl = {https://www.bibsonomy.org/bibtex/2173cc7ef345aecbecd65d097c2ac6d2b/kirk86}, description = {[1912.00018] On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks}, interhash = {7d689bd45962e2853fe90a5ff685f637}, intrahash = {173cc7ef345aecbecd65d097c2ac6d2b}, keywords = {bounds deep-learning optimization readings}, note = {cite arxiv:1912.00018Comment: 32 pages. arXiv admin note: substantial text overlap with arXiv:1901.06053}, timestamp = {2019-12-04T20:33:05.000+0100}, title = {On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks}, url = {http://arxiv.org/abs/1912.00018}, year = 2019 }

BibSonomy

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Zusammenfassung

Tags

Nutzer

Kommentare und Rezensionenanzeigen / verbergen

Zitieren Sie diese Publikation

Mehr Zitationsstile

Suchen auf