copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Audio-visual unit selection for the synthesis of photo-realistic talking-heads

E. Cosatto, G. Potamianos, and H. Graf. IEEE International Conference on Multimedia and Expo (ICME), page 619-622. New York, NY, USA, (August 2000)
DOI: 10.1109/ICME.2000.871439

Abstract

This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. Synthesizing a new speech animation from these recorded units starts with audio speech and its phonetic annotation from a text-to-speech synthesizer. Then, optimal image units are selected from the recorded set using a Viterbi search through a graph of candidate image units. Costs are attached to the nodes and arcs of the graph that are computed from similarities in both the acoustic and visual domain. While acoustic similarities are computed by simple phonetic matching, visual similarities are estimated using a hierarchical metric that uses high-level features (position and sizes of facial parts) and low-level features (projection of the image pixels on principal components of the database). This method preserves coarticulation and temporal coherence, producing smooth, lip-synched animations. Once the database has been prepared, this system can produce animations from ASCII text fully automatically.

Links and resources

BibTeX key: Cosatto2000a
entry type: inproceedings
address: New York, NY, USA
booktitle: IEEE International Conference on Multimedia and Expo (ICME)
year: 2000
month: aug
pages: 619-622
owner: schabus
file: :pdfs/cosatto_icme_2000.pdf:PDF
DOI: 10.1109/ICME.2000.871439

Cite this publication

%0 Conference Paper %1 Cosatto2000a %A Cosatto, Eric %A Potamianos, Gerasimos %A Graf, Hans P. %B IEEE International Conference on Multimedia and Expo (ICME) %C New York, NY, USA %D 2000 %K algorithm animation;multimedia animations;temporal area;phonetic based coherence;text-to-speech computer computing;realistic databases;Mouth;Spatial databases;Speech databases;Viterbi features;lip-synchronization;low-level features;mouth image images;speech matching;photo-realistic metric;high-level processing;Viterbi samples;sample search;acoustic selection;candidate signal similarities;audio-visual synthesis;Synthesizers;Visual synthesis;speech-synchronized synthesis;video synthesizer;variable-length talking-head talking-heads;recorded unit units;Animation;Cameras;Costs;Image units;coarticulation;computer video vision;hierarchical %P 619-622 %R 10.1109/ICME.2000.871439 %T Audio-visual unit selection for the synthesis of photo-realistic talking-heads %X This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. Synthesizing a new speech animation from these recorded units starts with audio speech and its phonetic annotation from a text-to-speech synthesizer. Then, optimal image units are selected from the recorded set using a Viterbi search through a graph of candidate image units. Costs are attached to the nodes and arcs of the graph that are computed from similarities in both the acoustic and visual domain. While acoustic similarities are computed by simple phonetic matching, visual similarities are estimated using a hierarchical metric that uses high-level features (position and sizes of facial parts) and low-level features (projection of the image pixels on principal components of the database). This method preserves coarticulation and temporal coherence, producing smooth, lip-synched animations. Once the database has been prepared, this system can produce animations from ASCII text fully automatically.

@inproceedings{Cosatto2000a, abstract = {This paper investigates audio-visual unit selection for the synthesis of photo-realistic, speech-synchronized talking-head animations. These animations are synthesized from recorded video samples of a subject speaking in front of a camera, resulting in a photo-realistic appearance. The lip-synchronization is obtained by optimally selecting and concatenating variable-length video units of the mouth area. Synthesizing a new speech animation from these recorded units starts with audio speech and its phonetic annotation from a text-to-speech synthesizer. Then, optimal image units are selected from the recorded set using a Viterbi search through a graph of candidate image units. Costs are attached to the nodes and arcs of the graph that are computed from similarities in both the acoustic and visual domain. While acoustic similarities are computed by simple phonetic matching, visual similarities are estimated using a hierarchical metric that uses high-level features (position and sizes of facial parts) and low-level features (projection of the image pixels on principal components of the database). This method preserves coarticulation and temporal coherence, producing smooth, lip-synched animations. Once the database has been prepared, this system can produce animations from ASCII text fully automatically.}, added-at = {2021-02-01T10:51:23.000+0100}, address = {New York, NY, USA}, author = {Cosatto, Eric and Potamianos, Gerasimos and Graf, Hans P.}, biburl = {https://www.bibsonomy.org/bibtex/2ab6a6e40165b8a073b776e11ec7ff0b7/m-toman}, booktitle = {IEEE International Conference on Multimedia and Expo (ICME)}, doi = {10.1109/ICME.2000.871439}, file = {:pdfs/cosatto_icme_2000.pdf:PDF}, interhash = {bbd0ed2ed22aa41ab905977ffe74c679}, intrahash = {ab6a6e40165b8a073b776e11ec7ff0b7}, keywords = {algorithm animation;multimedia animations;temporal area;phonetic based coherence;text-to-speech computer computing;realistic databases;Mouth;Spatial databases;Speech databases;Viterbi features;lip-synchronization;low-level features;mouth image images;speech matching;photo-realistic metric;high-level processing;Viterbi samples;sample search;acoustic selection;candidate signal similarities;audio-visual synthesis;Synthesizers;Visual synthesis;speech-synchronized synthesis;video synthesizer;variable-length talking-head talking-heads;recorded unit units;Animation;Cameras;Costs;Image units;coarticulation;computer video vision;hierarchical}, month = aug, owner = {schabus}, pages = {619-622}, timestamp = {2021-02-01T10:51:23.000+0100}, title = {Audio-visual unit selection for the synthesis of photo-realistic talking-heads}, year = 2000 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Audio-visual unit selection for the synthesis of photo-realistic talking-heads

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Audio-visual unit selection for the synthesis of photo-realistic talking-heads

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Audio-visual unit selection for the synthesis of photo-realistic talking-heads

Comments and Reviews
(0)