Article,

Learning Visually-Grounded Words and Syntax for a Scene Description Task

D. Roy.
(2002)Computer Speech and Language, In review..

Abstract

A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a `show-and-tell' procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using these structures, a planning algorithm integrates syntactic, semantic, and contextual constraints to generate natural and unambiguous descriptions of objects in novel scenes. The system generates syntactically well-formed compound adjective noun phrases, as well as relative spatial clauses. The acquired linguistic structures generalize from training data, enabling the production of novel word sequences which were never observed during training. The output of the generation system is synthesized using word-based concatenative synthesis drawing from the original training speech corpus. In evaluations of semantic comprehension by human judges, the performance of automatically generated spoken descriptions was comparable to human generated descriptions. This work is motivated by our long term goal of developing spoken language processing systems which grounds semantics in machine perception and action.

BibTeX key: Roy2002
entry type: article
year: 2002
Document: citeseer.ist.psu.edu/548334.html
note: Computer Speech and Language, In review.

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{Roy2002, abstract = {A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a `show-and-tell' procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using these structures, a planning algorithm integrates syntactic, semantic, and contextual constraints to generate natural and unambiguous descriptions of objects in novel scenes. The system generates syntactically well-formed compound adjective noun phrases, as well as relative spatial clauses. The acquired linguistic structures generalize from training data, enabling the production of novel word sequences which were never observed during training. The output of the generation system is synthesized using word-based concatenative synthesis drawing from the original training speech corpus. In evaluations of semantic comprehension by human judges, the performance of automatically generated spoken descriptions was comparable to human generated descriptions. This work is motivated by our long term goal of developing spoken language processing systems which grounds semantics in machine perception and action. }, added-at = {2007-01-22T14:00:43.000+0100}, author = {Roy, Deb K.}, biburl = {https://www.bibsonomy.org/bibtex/22692e9d34027345b0308f7bd974b37b5/tmalsburg}, description = {Learning Visually-Grounded Words and Syntax for a Scene Description Task - Roy (ResearchIndex)}, interhash = {77da93c2b02067a604d6c20651e662b1}, intrahash = {2692e9d34027345b0308f7bd974b37b5}, keywords = {grounding vision semantics machinelearning multimodality}, note = {Computer Speech and Language, In review.}, timestamp = {2007-01-22T14:00:43.000+0100}, title = {Learning Visually-Grounded Words and Syntax for a Scene Description Task}, url = {citeseer.ist.psu.edu/548334.html}, year = 2002 }

BibSonomy

Learning Visually-Grounded Words and Syntax for a Scene Description Task

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on