Zusammenfassung
We present an attention-based model for recognizing multiple objects in
images. The proposed model is a deep recurrent neural network trained with
reinforcement learning to attend to the most relevant regions of the input
image. We show that the model learns to both localize and recognize multiple
objects despite being given only class labels during training. We evaluate the
model on the challenging task of transcribing house number sequences from
Google Street View images and show that it is both more accurate than the
state-of-the-art convolutional networks and uses fewer parameters and less
computation.
Nutzer