An Empirical Study of Training Self-Supervised Vision Transformers
X. Chen, S. Xie, and K. He. (2021)cite arxiv:2104.02057Comment: Camera-ready, ICCV 2021, Oral. Code: https://github.com/facebookresearch/moco-v3.
Abstract
This paper does not describe a novel method. Instead, it studies a
straightforward, incremental, yet must-know baseline given the recent progress
in computer vision: self-supervised learning for Vision Transformers (ViT).
While the training recipes for standard convolutional networks have been highly
mature and robust, the recipes for ViT are yet to be built, especially in the
self-supervised scenarios where training becomes more challenging. In this
work, we go back to basics and investigate the effects of several fundamental
components for training self-supervised ViT. We observe that instability is a
major issue that degrades accuracy, and it can be hidden by apparently good
results. We reveal that these results are indeed partial failure, and they can
be improved when training is made more stable. We benchmark ViT results in MoCo
v3 and several other self-supervised frameworks, with ablations in various
aspects. We discuss the currently positive evidence as well as challenges and
open questions. We hope that this work will provide useful data points and
experience for future research.
Description
[2104.02057] An Empirical Study of Training Self-Supervised Vision Transformers
%0 Generic
%1 chen2021empirical
%A Chen, Xinlei
%A Xie, Saining
%A He, Kaiming
%D 2021
%K cs.CV
%T An Empirical Study of Training Self-Supervised Vision Transformers
%U http://arxiv.org/abs/2104.02057
%X This paper does not describe a novel method. Instead, it studies a
straightforward, incremental, yet must-know baseline given the recent progress
in computer vision: self-supervised learning for Vision Transformers (ViT).
While the training recipes for standard convolutional networks have been highly
mature and robust, the recipes for ViT are yet to be built, especially in the
self-supervised scenarios where training becomes more challenging. In this
work, we go back to basics and investigate the effects of several fundamental
components for training self-supervised ViT. We observe that instability is a
major issue that degrades accuracy, and it can be hidden by apparently good
results. We reveal that these results are indeed partial failure, and they can
be improved when training is made more stable. We benchmark ViT results in MoCo
v3 and several other self-supervised frameworks, with ablations in various
aspects. We discuss the currently positive evidence as well as challenges and
open questions. We hope that this work will provide useful data points and
experience for future research.
@misc{chen2021empirical,
abstract = {This paper does not describe a novel method. Instead, it studies a
straightforward, incremental, yet must-know baseline given the recent progress
in computer vision: self-supervised learning for Vision Transformers (ViT).
While the training recipes for standard convolutional networks have been highly
mature and robust, the recipes for ViT are yet to be built, especially in the
self-supervised scenarios where training becomes more challenging. In this
work, we go back to basics and investigate the effects of several fundamental
components for training self-supervised ViT. We observe that instability is a
major issue that degrades accuracy, and it can be hidden by apparently good
results. We reveal that these results are indeed partial failure, and they can
be improved when training is made more stable. We benchmark ViT results in MoCo
v3 and several other self-supervised frameworks, with ablations in various
aspects. We discuss the currently positive evidence as well as challenges and
open questions. We hope that this work will provide useful data points and
experience for future research.},
added-at = {2021-08-22T09:58:05.000+0200},
author = {Chen, Xinlei and Xie, Saining and He, Kaiming},
biburl = {https://www.bibsonomy.org/bibtex/29260ec554cae78b7a43e435af5d2cea7/aerover},
description = {[2104.02057] An Empirical Study of Training Self-Supervised Vision Transformers},
interhash = {3d455b6046e89525ea61bc14e48bc739},
intrahash = {9260ec554cae78b7a43e435af5d2cea7},
keywords = {cs.CV},
note = {cite arxiv:2104.02057Comment: Camera-ready, ICCV 2021, Oral. Code: https://github.com/facebookresearch/moco-v3},
timestamp = {2021-08-22T09:58:05.000+0200},
title = {An Empirical Study of Training Self-Supervised Vision Transformers},
url = {http://arxiv.org/abs/2104.02057},
year = 2021
}