Zusammenfassung
Learning visual representations of medical images is core to medical image
understanding but its progress has been held back by the small size of
hand-labeled datasets. Existing work commonly relies on transferring weights
from ImageNet pretraining, which is suboptimal due to drastically different
image characteristics, or rule-based label extraction from the textual report
data paired with medical images, which is inaccurate and hard to generalize. We
propose an alternative unsupervised strategy to learn medical visual
representations directly from the naturally occurring pairing of images and
textual data. Our method of pretraining medical image encoders with the paired
text data via a bidirectional contrastive objective between the two modalities
is domain-agnostic, and requires no additional expert input. We test our method
by transferring our pretrained weights to 4 medical image classification tasks
and 2 zero-shot retrieval tasks, and show that our method leads to image
representations that considerably outperform strong baselines in most settings.
Notably, in all 4 classification tasks, our method requires only 10% as much
labeled training data as an ImageNet initialized counterpart to achieve better
or comparable performance, demonstrating superior data efficiency.
Nutzer