Zusammenfassung
We present the Moments in Time Dataset, a large-scale human-annotated
collection of one million short videos corresponding to dynamic events
unfolding within three seconds. Modeling the spatial-audio-temporal dynamics
even for actions occurring in 3 second videos poses many challenges: meaningful
events do not include only people, but also objects, animals, and natural
phenomena; visual and auditory events can be symmetrical or not in time
(öpening" means "closing" in reverse order), and transient or sustained. We
describe the annotation process of our dataset (each video is tagged with one
action or activity label among 339 different classes), analyze its scale and
diversity in comparison to other large-scale video datasets for action
recognition, and report results of several baseline models addressing
separately and jointly three modalities: spatial, temporal and auditory. The
Moments in Time dataset designed to have a large coverage and diversity of
events in both visual and auditory modalities, can serve as a new challenge to
develop models that scale to the level of complexity and abstract reasoning
that a human processes on a daily basis.
Nutzer