Abstract
This paper establishes rates of universal approximation for the shallow
neural tangent kernel (NTK): network weights are only allowed microscopic
changes from random initialization, which entails that activations are mostly
unchanged, and the network is nearly equivalent to its linearization.
Concretely, the paper has two main contributions: a generic scheme to
approximate functions with the NTK by sampling from transport mappings between
the initial weights and their desired values, and the construction of transport
mappings via Fourier transforms. Regarding the first contribution, the proof
scheme provides another perspective on how the NTK regime arises from
rescaling: redundancy in the weights due to resampling allows individual
weights to be scaled down. Regarding the second contribution, the most notable
transport mapping asserts that roughly $1 / \delta^10d$ nodes are sufficient
to approximate continuous functions, where $\delta$ depends on the continuity
properties of the target function. By contrast, nearly the same proof yields a
bound of $1 / \delta^2d$ for shallow ReLU networks; this gap suggests a
tantalizing direction for future work, separating shallow ReLU networks and
their linearization.
Users
Please
log in to take part in the discussion (add own reviews or comments).