Research >> Current Projects
Speaker Recognition
Erinç Dikici, Murat Saraçlar
Introduction
Speaker recognition is the process of automatically recognizing the speaking person from a recording, using the characteristic vocal information included in speech. Vocal characteristic (voice) is considered a valuable biometric feature, because it is an outcome of both the person's physical attributes and the speaking style, and because it is very easy and not bothering to collect and process speech data.
Systems which use speech as a biometric measure cover a wide spectrum, from authentication applications to law enforcement. In telephone banking and e-commerce, voice can be used as a verifier to access customer accouns, or as an indicator of a conscious transaction. Speaker recognition can be used for indexing broadcast news programs or annotating recorded meetings, so that it becomes possible to spot the time intervals between which a particular speaker is speaking. Speech biometrics has also an important role in forensic science, by providing a mathematical measure to identify or verify the individual under investigation. Last but not least, speaker verification allows an easy and uninterrupted way for access control to facilities and objects, and can often be consistently combined with other biometric features to provide a higher level of security.
Classification of Speaker Recognition Systems
There are two subproblems in speaker recognition:
- Speaker identification: Determining the exact identity of a speaker
- Speaker verification: Deciding whether the speech belongs to a claimed identity

Figure 1. Speaker recognition problem types (Figures courtesy of Douglas Reynolds, MIT Lincoln Labs)
Based on whether test utterances are allowed to come from any unknown identity, speaker identification problem can be divided into closed or open set problems. In closed set identification, the system knows that the test utterance belongs to one of the speakers it is trained with, therefore forces itself to decide on an identity. In open set identification, the system may reject to assign the utterance to any of the speakers.
Based on whether the transcription of speech is used, speaker recognition can also be classified into three: In text-dependent recognition, the system knows eforehand what the speaker is going to say (i.e., the password). Although higher performance could be achieved by knowing the utterance beforehand, such systems are more prone to cheating. To overcome this problem, text-prompted speaker recognition systems are developed, which instantaneously generates a random word or sequence of numbers and asks the user to repeat it in a short amount of time. Thirdly in text-independent recognition, what is being said is not known. Therefore it offers a more flexible system as well as a more challenging problem.
Speaker Recognition Experiments at BUSIM SPG

Figure 2. General block diagram of a speaker verification system
Figure 2 shows the block diagram of a typical speaker verification system. Basicly, the system extracts parameters from the input speech signal to represent vocal characteristics and uses these information to build representative speaker models. When a test utterance comes with a claim, it tests its parameters under the claimed speaker's model, and calculates a similarity score. If the similarity is above a threshold, the system accepts the speaker, if not, it assumes an illegal access attempt and rejects.
The performance of speaker verification methods greatly depends upon three main factors:
- Data duration: The amount of speech to enroll the speakers to the system. The longer the duration of training data, the higher the verification performance. A similar rule also applies to testing. Large amounts of speech data may be available in forensic and broadcast indexing applications. However, for a realistic security system access scenario, especially the testing duration should be kept as short as possible.
- Modeling techniques and complexity: Hyperparameters of the training methods must be optimized to ensure accurate verification. Generally more complex models are needed to get a similar performance with increasing amounts of data.
- Session variability: Acoustic mismatches between recordings of the training and testing sessions generate a big challenge against achieving high verification performance.
At BUSIM SPG, we investigate the performance of text-independent speaker verification methods under changing data durations, model complexities and acoustic conditions.
Conference Publications:
- Erinç Dikici, Murat Saraçlar, “Investigating the Effect of Training Data Partitioning for GMM Supervector Based Speaker Verification”, 24th International Symposium on Computer and Information Sciences, Northern Cyprus, 2009. (pdf)
Conference Publications (in Turkish):
- Erinç Dikici, Murat Saraçlar, "GKM ve DVM Tabanlı Konuşmacı Doğrulamada Veri Süresi ve Model Büyüklüğüne Dayalı Başarım Analizi", IEEE 17. Sinyal İşleme ve İletişim Uygulamaları Kurultayı (SİU'09), Antalya, 2009. (pdf)
Theses:
Erinç Dikici, “Effects of Data Duration, Model Size and Session Variability on Speaker Verification Performance”, MS Thesis, 2009.
References:
- F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-Garcia, D. Petrovska-Delacretaz, and D. A.
Reynolds, "A tutorial on text-independent speaker verification", EURASIP Journal on Applied Signal Processing, Vol. 4, pp. 430-451, 2004. - J. P. Campbell, "Speaker recognition: a tutorial", Proceedings of the IEEE, Vol. 85, No. 9, pp. 1437-1462, 1997.
- S. Kung, M. Mak, and S. Lin, Biometric authentication: a machine learning approach, Prentice Hall Press, Upper Saddle River, NJ, USA, 2004.
- D. Reynolds, "An overview of automatic speaker recognition technology", Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP '02), Vol. 4,
pp. IV-4072-IV-4075, 2002.