Research >> Current Projects

 

Speech and Text Driven 3D Face Synthesis

Arman Savran, Levent Arslan, Lale Akarun

 

Automatic synthesis of 3D talking faces from speech and text with computers has many application areas in human computer interaction, entertainment, and for aiding hearing impaired. For example, user interface agents in internet applications like in e-learning or interactive TV can be designed. Avatars in virtual chat rooms or videophones can be possible. Another application field is in computer game and movie industries where automatic lip-synching for facial animation shortens production time significantly. Also, applications that provide lip-reading support for hearing impaired can be realized.

However, automatic generation of visual speech is very challenging, since we are all experts in our facial expressions and thus can not tolerate any unnatural looking detail. Therefore, the accurate synthesis of speech animation is the primary concern in this project. On the other hand, there are other visual signals like emotions, facial gestures, and involuntary motions such as eye-blinking and head motion besides speech expressions. Realistically producing these types of expressions are also other research topics in this project.


Visual Speech
Our approach to visual speech synthesis is based on data captured from a speaker. This makes the synthesis of articulation realistic and natural, since the diverse coarticulation effects can be generated. The general framework of the procedure is depicted in Figure 1.

 

VoiceConversionFramework

Figure 1. General framework for visual speech synthesis

To gather training data a 3D facial motion capture system, which can work at 30fps, was developed. It is a stereo-based system and employs ordinary color stickers to facilitate tracking. For our experiments we used 29 face points covering necessary MPEG-4 feature points (Figure 2).

VoiceConversionFramework

Figure 2. Face points for tracking

In training, a codebook, which stores audio and visual features with corresponding phonetic context, is created. Based on this codebook, synthesis is performed. We used an MPEG-4 animation engine, which was obtained from DIST, University of Genova (Italy) to demonstrate the synthesis results.

VoiceConversionFramework