Inproceedings,

Taking into Account the User's Focus of Attention with the Help of Audio-Visual Information: Towards less Artificial Human-Machine-Communication

, , , , and .
Proceedings of AVSP'07: International Conference on Auditory-Visual Speech Processing, Kasteel Groenendaal, Hilvarenbeek, The Netherlands, (2007)

Abstract

In the German SmartWeb project, the user is interacting with the web via a PDA in order to get information on, for example, points of interest. To overcome the tedious use of devices such as push-to-talk, but still to be able to tell whether the user is addressing the system or talking to herself or to a third person, we developed a module that monitors speech and video in parallel. Our database (3.2 hours of speech, 2086 turns) has been recorded in a real-life setting, indoors as well as outdoors, with unfavourable acoustic and light conditions. With acoustic features, we classify up to 4 different types of addressing (talking to the system: On-Talk, reading from the display: Read Off-Talk, paraphrasing information presented on the screen: Paraphrasing Off-Talk, talking to a third person or to oneself: Spontaneous Off-Talk). With a camera integrated in the PDA, we record the user's face and decide whether she is looking onto the PDA or somewhere else. We use three different types of turn features based on classification scores of frame-based face detection and word-based analysis: 13 acoustic-prosodic features, 18 linguistic features, and 9 video features. The classification rate for acoustics only is up to 62 percent for the four-class problem, and up to 77 percent for the most important two-class problem user is focussing on interaction with the system or not. For video only, it is 45 percent and 71 percent, respectively. By combining the two modalities, and using linguistic information in addition, classification performance for the two-class problem so far rises up to 85 percent.

Tags

Users

  • @flint63

Comments and Reviews