Abstract
A spoken language generation system has been developed that learns
to describe objects in computer-generated visual scenes. The system is
trained by a `show-and-tell' procedure in which visual scenes are paired
with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure,
word classes, and individual words. Using these structures, a planning
algorithm integrates syntactic, semantic, and contextual constraints to
generate natural and unambiguous descriptions of objects in novel scenes.
The system generates syntactically well-formed compound adjective noun
phrases, as well as relative spatial clauses. The acquired linguistic structures generalize from training data, enabling the production of novel word
sequences which were never observed during training. The output of the
generation system is synthesized using word-based concatenative synthesis drawing from the original training speech corpus. In evaluations of
semantic comprehension by human judges, the performance of automatically generated spoken descriptions was comparable to human generated
descriptions. This work is motivated by our long term goal of developing
spoken language processing systems which grounds semantics in machine
perception and action.
Users
Please
log in to take part in the discussion (add own reviews or comments).