Inproceedings,

Training data-efficient image transformers & distillation through attention

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou.
Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, page 10347--10357. PMLR, (18--24 Jul 2021)

Full text

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These high-performing vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption. In this work, we produce competitive convolution-free transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data. We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.

BibTeX key: pmlr-v139-touvron21a
entry type: inproceedings
booktitle: Proceedings of the 38th International Conference on Machine Learning
year: 2021
month: 18--24 Jul
pages: 10347--10357
publisher: PMLR
series: Proceedings of Machine Learning Research
volume: 139
pdf: http://proceedings.mlr.press/v139/touvron21a/touvron21a.pdf
Document: https://proceedings.mlr.press/v139/touvron21a.html

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{pmlr-v139-touvron21a, abstract = {Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These high-performing vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption. In this work, we produce competitive convolution-free transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data. We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.}, added-at = {2022-07-11T20:00:04.000+0200}, author = {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve}, biburl = {https://www.bibsonomy.org/bibtex/2e432cd6b9882f5169202f17023c759dd/simonh}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, description = {Training data-efficient image transformers & distillation through attention}, editor = {Meila, Marina and Zhang, Tong}, interhash = {d0078fd7ff7fc87997d38aab0b6b536e}, intrahash = {e432cd6b9882f5169202f17023c759dd}, keywords = {}, month = {18--24 Jul}, pages = {10347--10357}, pdf = {http://proceedings.mlr.press/v139/touvron21a/touvron21a.pdf}, publisher = {PMLR}, series = {Proceedings of Machine Learning Research}, timestamp = {2022-07-12T10:09:24.000+0200}, title = {Training data-efficient image transformers & distillation through attention}, url = {https://proceedings.mlr.press/v139/touvron21a.html}, volume = 139, year = 2021 }

BibSonomy

Training data-efficient image transformers & distillation through attention

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on