Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos.

D. He, X. Zhao, J. Huang, F. Li, X. Liu, и S. Wen.
(2019)cite arxiv:1901.06829Comment: AAAI 2019.

Аннотация

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.

ключ BibTeX: journals/corr/abs-1901-06829
тип записи: misc
год: 2019
url: http://arxiv.org/abs/1901.06829
Примечание: cite arxiv:1901.06829Comment: AAAI 2019

тэги

Пользователи данного ресурса

Комментарии и рецензиипоказать / перейти в невидимый режим

@nmatsuk 6 лет назад
Ссылки
Закладки
Рецензия удаляется

Пожалуйста, войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)

BibSonomy