Research >> Current Projects

 

Voice Conversion

Oytun Türk, Levent Arslan

 

Voice conversion is the process of automatic transformation of a source speaker’s voice to that of a target speaker’s. Figure 1 shows the general framework for voice conversion.

VoiceConversionFramework

Figure 1. General framework for voice conversion.

State-of-the-art voice conversion algorithms employ two common stages: Training and Transformation. The voice conversion system gathers information from the source and target speaker voices and automatically formulates voice conversion rules at the training stage. The transformation stage employs the conversion rules to modify the source voice in order to match the characteristics of the target voice.

Applications of Voice Conversion:

Text-To-Speech Synthesis (TTS) quality has increased by the employment of large databases and unit-selection techniques (Hunt and Black, 1996; Dutoit, 1997). As voice conversion requires less training data (5-10 minutes of voice recordings), it is advantageous to employ voice conversion for creating new TTS voices out of the existing ones. Therefore, TTS has been considered as the primary application field for voice conversion in the literature (Kain and Macon, 1998; Zhang, et. al., 2001). With the development of high-quality voice conversion systems, many other applications can be implemented some of which were demonstrated in our previous work. We have reported a demonstration for dubbing movies by employing only several dubbers, generating the voice of famous actresses/actors in a foreign language which they can not speak, and generating the voices of actresses/actors who are not alive (Turk and Arslan, 2002, 2003). Other dubbing applications might be to regenerate the voices of actresses/actors who have lost their voice characteristics due to old age and to perform dubbing for radio broadcasts. Integrated with 3-D facial animation techniques, voice conversion can be employed in creating virtual characters or even a virtual copy of one speaker’s audio-visual identity. Voice conversion may also serve as a component of a speech-to-speech automatic translation system to create one speaker’s voice in a language that s/he can not speak using a cascaded architecture of speech recognition, automatic translation, text-to-speech, and voice conversion modules.

Research Topics on Voice Conversion:

At BUSIM, we focus on the following issues on voice conversion:

(*) VOX is developed for Voxonic Inc., New York. For further information, please refer to www.voxonic.com

Voice Conversion Demo:

Demonstrations in different languages, between different source-target speaker pairs, and using different strategies will be available soon.

Journal Publications:

  1. Turk, O. and Arslan, L. M., 2005, "Robust Processing Techniques for Voice Conversion", Computer Speech and Language, in press.
  2. L.M. Arslan “Speaker Transformation Algorithm using Segmental Codebooks (STASC)’’, Speech Communication Journal, vol. 28, pp. 211-226, June 1999. (pdf)

Conference Publications:

  1. Turk, O., and Arslan, L. M., 2005, "Donor Selection for Voice Conversion", 13th European Signal Processing Conference - EUSIPCO 2005, Antalya, Turkey. (pdf)
  2. Turk, O., Schröder, M., Bozkurt, B., and Arslan L. M., 2005, "Voice Quality Interpolation for Emotional Text-To-Speech Synthesis", INTERSPEECH 2005, Lisbon, Portugal. (pdf)
  3. Turk, O., and Arslan, L. M., 2003, "New Methods for Vocal Tract and Pitch Contour Transformation", EUROSPEECH 2003, Geneva, Switzerland. (pdf)
  4. Turk, O., and Arslan, L. M., 2003, "Subjective Evaluations for Perception of Speaker Identity Through Acoustic Feature Transplantations", EUROSPEECH 2003, Geneva, Switzerland. (pdf)
  5. Turk, O., and Arslan, L. M., 2002, "Subband Based Voice Conversion", Proceedings of the ICSLP 2002, vol. 1, pp. 289-292, September 2002, Denver, Colorado, USA. (pdf)
  6. Ormancý, E., Nikbay, U. H., Turk, O., and Arslan, L. M., 2002, "Subjective Assessment of Frequency Bands for Perception of Speaker Identity", Proceedings of the ICSLP 2002, vol 4, pp. 2581-2584, September 2002, Denver, Colorado, USA. (pdf)
  7. L.M. Arslan and D. Talkin. “Speaker Transformation using Sentence HMM based alignments and detailed prosody modification’’, IEEE Proc. ICASSP, Seattle, USA, May 1998. (pdf)
  8. L.M. Arslan and D. Talkin. “Voice Conversion by Segmental Codebook Mapping of Line Spectral Frequencies and Excitation Spectrum’’, EUROSPEECH Proceedings, vol. 3, pp. 1347-1350, Rhodes Greece, September 1997. (pdf)

Theses:
New Methods For Voice Conversion”, M.S. Thesis, by Oytun Türk, 2003
Turk, O., Cross-Lingual Voice Conversion, PhD Thesis.

References:

  1. Abe, M., Nakamura, S., Shikano, K., Kuwabara, H., 1988. Voice conversion through vector quantization. Proc. of the IEEE ICASSP, 565-568.
  2. Acero, A., 1993. Acoustical and Environmental Robustness in Automatic Speech Recognition. Kluwer Academic Publishers, Dordrecht.
  3. Arslan, L.M. and Talkin, D., 1997. Voice Conversion by Segmental Codebook Mapping of Line Spectral Frequencies and Excitation Spectrum. EUROSPEECH Proceedings, vol. 3, pp. 1347-1350, Rhodes Greece.
    Arslan, L.M., 1999. Speaker transformation algorithm using segmental codebooks. Speech Communication 28, 211-226.
  4. Childers, D.G., 1995. Glottal source modeling for voice conversion. Speech Communication 16 (2), 127-138.
  5. Childers, D.G., Lee, C.-K., 1991. Vocal quality factors: Analysis, synthesis, and perception. Journal of the Acoustical Society of America 90, 2394-2410.
  6. Crosmer, J.R., 1985. Very low bit rate speech coding using the line spectrum pair transformation of the LPC coefficients. Ph.D. Thesis, Elec. Eng., Georgia Inst. Technology.
  7. Davis, S., and Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech, Signal Processing 28, 357-366.
  8. Drioli, C., 1999. Radial basis function networks for conversion of sound spectra. Proc. of the 2nd COST G-6 Workshop on Digital Audio Effects (DAFx99), NTNU, Trondheim.
  9. Dutoit, T., 1997. High-quality text-to-speech synthesis: an overview. Journal of Electrical & Electronics Eng., Australia: Special Issue on Speech Recognition and Synthesis, vol. 17, no 1, 25-37.
  10. Fant, G., Liljencrants, J., Lin, Q. 1985. A four-parameter model of the glottal flow. Speech Transmission Laboratory Quarterly Progress and Status Reports, No. 4, Royal Institute of Technology, Stockholm, Sweden, 1-13.
  11. Flanagan, J.L., Golden, R.M., 1966. Phase vocoder. Bell System Tech. Journal 45, 1493-1500.
    Furui, S., 1986. Research on individuality features in speech waves and automatic speaker recognition techniques. Speech Communication 5 (2), 183-197.
  12. Holmes, W., Holmes, J., Judd, M., 1990. Extension of the bandwith of the JSRU parallel-formant synthesizer for high quality synthesis of male and female speech. Proc. of the IEEE ICASSP 90 (1), 313-316.
  13. Hunt, A., Black, A.W., 1996. Unit selection in a concatenative speech synthesis system using a large speech database. Proc. of the IEEE ICASSP 96, 373-376.
  14. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., 1990. DARPA-TIMIT acoustic-phonetic continuous speech corpus [CDROM].
  15. Itakura, F., 1975a. Line spectrum representation of linear predictor coefficients of speech signals. Journal of Acoust. Soc. of America, vol. 57, S35 (A).
  16. Itakura, F., 1975b. Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-23, no. 1, 67-72.
  17. Itoh, K., Saito, S., 1982. Effects of acoustical feature parameters of speech on perceptual identification of speaker. IECE Transactions, J65-A, 101-108.
  18. Kain, A., Macon, M., 1998. Personalizing a speech synthesizer by voice adaptation. Proceedings of the Third ESCA/COCOSDA International Speech Synthesis Workshop, 225-230.
  19. Knohl, L., Rinscheid, A., 1993. Speaker normalization with self-organizing feature maps. Proc. IJNN-93-Nagoya, Int. Joint Conf. on Neural Networks, 243-246.
  20. Kuwabara, H., Sagisaka, Y., 1995. Acoustic characteristics of speaker individuality: control and conversion. Speech Communication 16, 165-173.
  21. Laroche, J., Stylianou, Y., Moulines, E., 1993. HNS: Speech modification based on a harmonic + noise model. Proc. of the IEEE ICASSP-93, Minneapolis.
  22. Makhoul, J., 1975. Linear prediction: A tutorial review. Proc. of the IEEE, vol. 63, 561-580.
  23. Matsumoto, H., Hiki, S., Sone, T., Nimura, T., 1973. Multidimensional representation of personal quality of vowels and its acoustical correlates. IEEE Trans. AU, AU-21, 428-436.
  24. McAulay, R.J., Quatieri, T.F., 1995. Sinusoidal coding, in: Kleijn, W.B., Paliwal, K.K. (Eds.), Speech Coding and Synthesis, Elsevier Science B.V., Netherlands, pp. 121-173.
  25. Mizuno, H., Abe, M., 1995. Voice conversion algorithm based on piecewise linear conversion rules of formant frequency and spectrum tilt. Speech Communication 16, 153-164.
  26. Moulines, E., Charpentier, F., 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453-467.
  27. Moulines, E., Sagisaka, Y, (Eds.) 1995. Voice conversion: state of the art and perspectives (special issue of Speech Communication). Elsevier Science B.V., Netherlands, 16(2).
  28. Moulines, E., Verhelst, W., 1995. Time-domain and frequency-domain techniques for prosodic modification of speech, in: Kleijn, W.B., and Paliwal, K.K., (Eds.), Speech Coding and Synthesis, Elsevier Science B.V., Netherlands, 519-555.
  29. Narendranath, M., Murthy, H.M., Rajendran, S., Yegnanarayana, B., 1995. Transformation of formants for voice conversion using artificial neural networks. Speech Communication 16, 207-216.
  30. Necioglu, B.F., Clements, M.A., Barnwell III, T.P., Schmidt-Nielsen, A., 1998. Perceptual relevance of objectively measured descriptors for speaker characterization. Proc. of the IEEE ICASSP, 869-872.
  31. Quatieri, T.F., R.J. McAulay, 1992. Shape invariant time-scale and pitch modification of speech. IEEE Transactions on Signal Processing, vol. 40, no. 3, 497-510.
  32. Rabiner, L.R., Schafer, R.W., 1978. Digital Processing of Speech Signals, Prentice-Hall Inc., Englewood Cliffs, New Jersey.
  33. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognitiom. Proc. of the IEEE, vol. 77, no. 2, 257-286.
  34. Rothweiler, J., 1999. A root-finding algorithm for line spectral frequencies. Proc. of the IEEE ICASSP 99, 661-664.
  35. Stylianou, Y., Cappe, O., Moulines, E., 1998. Continuous probabilistic transform for voice conversion. IEEE Trans.s on Speech and Audio Proc., vol. 6, no. 2, 131-142.
  36. Talkin, D., 1995. A robust algorithm for pitch tracking (RAPT), in: Kleijn, W.B., and Paliwal, K.K. (Eds.), Speech Coding and Synthesis, Elsevier Science B.V., Netherlands, pp. 121-173.
  37. Turk, O., Arslan, L.M., 2002. Subband based voice conversion. Proc. of the ICSLP 2002, vol. 1, 289-292.
    Turk, O., Arslan, L.M., 2003. Voice conversion methods for vocal tract and pitch contour modification. Proc. of the Eurospeech (Interspeech), 2845-2848.
  38. Zhang, W., Shen, L.Q., Tang, D., 2001. Voice conversion based on acoustic feature transformation. Proc. of the 6th National Conference on Man-Machine Speech Communications.