Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
A. Baevski, A. Babu, W. Hsu, and M. Auli. https://ai.facebook.com/blog/ai-self-supervised-learning-data2vec/, (2022)cite arxiv:2212.07525.
Abstract
Current self-supervised learning algorithms are often modality-specific and
require large amounts of computational resources. To address these issues, we
increase the training efficiency of data2vec, a learning objective that
generalizes across several modalities. We do not encode masked tokens, use a
fast convolutional decoder and amortize the effort to build teacher
representations. data2vec 2.0 benefits from the rich contextualized target
representations introduced in data2vec which enable a fast self-supervised
learner. Experiments on ImageNet-1K image classification show that data2vec 2.0
matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time,
on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x
less time, and on GLUE natural language understanding it matches a retrained
RoBERTa model in half the time. Trading some speed for accuracy results in
ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
Description
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
%0 Generic
%1 baevski2022efficient
%A Baevski, Alexei
%A Babu, Arun
%A Hsu, Wei-Ning
%A Auli, Michael
%D 2022
%K data2vec dmir-readinggroup
%T Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
%U http://arxiv.org/abs/2212.07525
%X Current self-supervised learning algorithms are often modality-specific and
require large amounts of computational resources. To address these issues, we
increase the training efficiency of data2vec, a learning objective that
generalizes across several modalities. We do not encode masked tokens, use a
fast convolutional decoder and amortize the effort to build teacher
representations. data2vec 2.0 benefits from the rich contextualized target
representations introduced in data2vec which enable a fast self-supervised
learner. Experiments on ImageNet-1K image classification show that data2vec 2.0
matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time,
on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x
less time, and on GLUE natural language understanding it matches a retrained
RoBERTa model in half the time. Trading some speed for accuracy results in
ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
@misc{baevski2022efficient,
abstract = {Current self-supervised learning algorithms are often modality-specific and
require large amounts of computational resources. To address these issues, we
increase the training efficiency of data2vec, a learning objective that
generalizes across several modalities. We do not encode masked tokens, use a
fast convolutional decoder and amortize the effort to build teacher
representations. data2vec 2.0 benefits from the rich contextualized target
representations introduced in data2vec which enable a fast self-supervised
learner. Experiments on ImageNet-1K image classification show that data2vec 2.0
matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time,
on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x
less time, and on GLUE natural language understanding it matches a retrained
RoBERTa model in half the time. Trading some speed for accuracy results in
ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.},
added-at = {2023-04-28T10:12:50.000+0200},
author = {Baevski, Alexei and Babu, Arun and Hsu, Wei-Ning and Auli, Michael},
biburl = {https://www.bibsonomy.org/bibtex/2103d4d889843ad70057e3e919966c8c8/martinr},
description = {Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language},
howpublished = {https://ai.facebook.com/blog/ai-self-supervised-learning-data2vec/},
interhash = {7933bfc9f37229183f3e41cdea9a9d5c},
intrahash = {103d4d889843ad70057e3e919966c8c8},
keywords = {data2vec dmir-readinggroup},
note = {cite arxiv:2212.07525},
timestamp = {2023-04-28T10:12:50.000+0200},
title = {Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language},
url = {http://arxiv.org/abs/2212.07525},
year = 2022
}