A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly and 2 other author(s). (2020)cite arxiv:2010.11929Comment: Fine-tuning code and pre-trained models are available at https://github.com/google-research/vision_transformer. ICLR camera-ready version with 2 small modifications: 1) Added a discussion of CLS vs GAP classifier in the appendix, 2) Fixed an error in exaFLOPs computation in Figure 5 and Table 6 (relative performance of models is basically not affected).
R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. (2016)cite arxiv:1610.02391Comment: This version was published in International Journal of Computer Vision (IJCV) in 2019; A previous version of the paper was published at International Conference on Computer Vision (ICCV'17).
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. (2019)cite arxiv:1911.05722Comment: CVPR 2020 camera-ready. Code: https://github.com/facebookresearch/moco.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. (2020)cite arxiv:2002.05709Comment: ICML'2020. Code and pretrained models at https://github.com/google-research/simclr.
X. Wang, R. Girshick, A. Gupta, and K. He. (2017)cite arxiv:1711.07971Comment: CVPR 2018, code is available at: https://github.com/facebookresearch/video-nonlocal-net.
P. Dhariwal, and A. Nichol. (2021)cite arxiv:2105.05233Comment: Added compute requirements, ImageNet 256$\times$256 upsampling FID and samples, DDIM guided sampler, fixed typos.