Abstract
Tuning hyperparameters of learning algorithms is hard because gradients are
usually unavailable. We compute exact gradients of cross-validation performance
with respect to all hyperparameters by chaining derivatives backwards through
the entire training procedure. These gradients allow us to optimize thousands
of hyperparameters, including step-size and momentum schedules, weight
initialization distributions, richly parameterized regularization schemes, and
neural network architectures. We compute hyperparameter gradients by exactly
reversing the dynamics of stochastic gradient descent with momentum.
Users
Please
log in to take part in the discussion (add own reviews or comments).