Research >> Current Projects

 

Speaker Recognition

Erinç Dikici, Murat Saraçlar

 

Introduction

Speaker recognition is the process of automatically recognizing the speaking person from a recording, using the characteristic vocal information included in speech. Vocal characteristic (voice) is considered a valuable biometric feature, because it is an outcome of both the person's physical attributes and the speaking style, and because it is very easy and not bothering to collect and process speech data.

Systems which use speech as a biometric measure cover a wide spectrum, from authentication applications to law enforcement. In telephone banking and e-commerce, voice can be used as a verifier to access customer accouns, or as an indicator of a conscious transaction. Speaker recognition can be used for indexing broadcast news programs or annotating recorded meetings, so that it becomes possible to spot the time intervals between which a particular speaker is speaking. Speech biometrics has also an important role in forensic science, by providing a mathematical measure to identify or verify the individual under investigation. Last but not least, speaker verification allows an easy and uninterrupted way for access control to facilities and objects, and can often be consistently combined with other biometric features to provide a higher level of security.

Classification of Speaker Recognition Systems

There are two subproblems in speaker recognition:

Speaker Recognition Problem Types

Figure 1. Speaker recognition problem types (Figures courtesy of Douglas Reynolds, MIT Lincoln Labs)

Based on whether test utterances are allowed to come from any unknown identity, speaker identification problem can be divided into closed or open set problems. In closed set identification, the system knows that the test utterance belongs to one of the speakers it is trained with, therefore forces itself to decide on an identity. In open set identification, the system may reject to assign the utterance to any of the speakers.

Based on whether the transcription of speech is used, speaker recognition can also be classified into three: In text-dependent recognition, the system knows eforehand what the speaker is going to say (i.e., the password). Although higher performance could be achieved by knowing the utterance beforehand, such systems are more prone to cheating. To overcome this problem, text-prompted speaker recognition systems are developed, which instantaneously generates a random word or sequence of numbers and asks the user to repeat it in a short amount of time. Thirdly in text-independent recognition, what is being said is not known. Therefore it offers a more flexible system as well as a more challenging problem.

Speaker Recognition Experiments at BUSIM SPG

Speaker Verification Block Diagram

Figure 2. General block diagram of a speaker verification system

Figure 2 shows the block diagram of a typical speaker verification system. Basicly, the system extracts parameters from the input speech signal to represent vocal characteristics and uses these information to build representative speaker models. When a test utterance comes with a claim, it tests its parameters under the claimed speaker's model, and calculates a similarity score. If the similarity is above a threshold, the system accepts the speaker, if not, it assumes an illegal access attempt and rejects.

The performance of speaker verification methods greatly depends upon three main factors:

At BUSIM SPG, we investigate the performance of text-independent speaker verification methods under changing data durations, model complexities and acoustic conditions.

 

Conference Publications:

  1. Erinç Dikici, Murat Saraçlar, “Investigating the Effect of Training Data Partitioning for GMM Supervector Based Speaker Verification”, 24th International Symposium on Computer and Information Sciences, Northern Cyprus, 2009. (pdf)

Conference Publications (in Turkish):

  1. Erinç Dikici, Murat Saraçlar, "GKM ve DVM Tabanlı Konuşmacı Doğrulamada Veri Süresi ve Model Büyüklüğüne Dayalı Başarım Analizi", IEEE 17. Sinyal İşleme ve İletişim Uygulamaları Kurultayı (SİU'09), Antalya, 2009. (pdf)

Theses:
Erinç Dikici, “Effects of Data Duration, Model Size and Session Variability on Speaker Verification Performance”, MS Thesis, 2009.

References:

  1. F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-Garcia, D. Petrovska-Delacretaz, and D. A.
    Reynolds, "A tutorial on text-independent speaker verification", EURASIP Journal on Applied Signal Processing, Vol. 4, pp. 430-451, 2004.
  2. J. P. Campbell, "Speaker recognition: a tutorial", Proceedings of the IEEE, Vol. 85, No. 9, pp. 1437-1462, 1997.
  3. S. Kung, M. Mak, and S. Lin, Biometric authentication: a machine learning approach, Prentice Hall Press, Upper Saddle River, NJ, USA, 2004.
  4. D. Reynolds, "An overview of automatic speaker recognition technology", Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP '02), Vol. 4,
    pp. IV-4072-IV-4075, 2002.