Abstract
Temporal segmentation of long videos is an important problem, that has
largely been tackled through supervised learning, often requiring large amounts
of annotated training data. In this paper, we tackle the problem of
self-supervised temporal segmentation of long videos that alleviate the need
for any supervision. We introduce a self-supervised, predictive learning
framework that draws inspiration from cognitive psychology to segment long,
visually complex videos into individual, stable segments that share the same
semantics. We also introduce a new adaptive learning paradigm that helps reduce
the effect of catastrophic forgetting in recurrent neural networks. Extensive
experiments on three publicly available datasets - Breakfast Actions, 50
Salads, and INRIA Instructional Videos datasets show the efficacy of the
proposed approach. We show that the proposed approach is able to outperform
weakly-supervised and other unsupervised learning approaches by up to 24\% and
have competitive performance compared to fully supervised approaches. We also
show that the proposed approach is able to learn highly discriminative features
that help improve action recognition when used in a representation learning
paradigm.
Users
Please
log in to take part in the discussion (add own reviews or comments).