Voice Recognition


The actual importance of human-machine communication can not be discussed. From internet-banking to voice web-surfing tools can be seen the necessity of this way of communication. As humans, we configure the speech to send messages, information.Therefore, it is not only necessary that the machine can react to voice (sound) stimulus, but also to understand the message that the speech transmits.

The proposed analysis in this report aims to identify the origin of the voice with the aim of operating a security system. The system should be able to recognize three characteristic voices from three different sources o speakers. It is worthy to understand that the analysis is to be done over the attributes of the voice and the system must identify the speaker. Once the system has identified correctly one of the three predetermined individuals as the speaker, it must acknowledge it and operate the security system. In case that the speaker is not recognised, the system must acknowledge it as unknown speaker as well, but block him from the security system.

This system, whatever is its purpose, will be fit at least with a sensor and a control unit. The sensor will act as an input for the signal (voice), while the control unit processes the information. This section is going to be simulated in the software application MATLAB, using the default audio card as a data acquisition interface, and a state of the art microphone connected to the audio card as a sensor. The analysis and control algorithms are the result of this work.

Security system model. The actuator can be physical, or the control unit might only give an authorizing/denying signal.

See fig1 for a system description

System Model:

The system is intended to recognize a speaker. It is necessary therefore, to understand the characteristics of the voice, and what differentiates the voice from one person to another. Voice is the sound produced by the phonetic apparatus from a human. It is produced by two mechanisms:

1.Sounds are given by the vibration of the vocal cords and the resonation in mouth and nose cavities.

2.The flux of air is interrupted by lips, tongue, teeth or trachea.

As a result of this combination, speech is the combination of sounds and stops. Analysis can be done over these two aspects in combination or independently. It is what concerns individuality to the voice of one person differentiating him from the rest.

As it has been explained, the output of this work is going to be the control algorithm for the DSP, simulated under MATLAB environment. Three general steps have to be followed, as acquisition, processing and decision. Focusing in the processing step it is clear because of the nature of recognition that a comparison is going to be background for identifying the speaker. It is then necessary that the authorized speakers have recorded their voices in the system beforehand. The decision step will be conditioned by this comparison. See fig 2

Comparison with recorded data is necessary to grant the identity of the speaker

Fig 2: Comparison with recorded data is necessary to grant the identity of the speaker

The characteristics of the human voice depend on many aspects like mood of the speaker, sex of the speaker, pronunciation (speech speed, accents), channel (noise in the environment) and situation of the speaker respect to the receiver. It is not the scope of this work to analyze these inferences. Therefore, the work has been focused in gathering information from the frequencies included in the sound associated with the speech.

Diverse texts have different information about the range of frequencies for the human voice. Kim, Y (1999) and Dajer, M.E. et al (2005). Musicologists propose that fundamental frequencies go from 80 Hz to 1100 Hz, while speech analysis applications are applying ranges from 100 to 10KHz, and it is widely known that the phone line goes up to 3.4 or 4 KHz. It has been necessary to do an analysis of the frequency range and the location of the main harmonics for the authorized speakers.

The Cepstrum analysis.

This analysis is based in the identification of voice with sound. The sound emitted when speaking is a combination of different frequencies, and the combination of this frequencies vary from one person to another. With Fast Fourier Transform, the sound wave is transformed to the frequency domain. Therefore, the characteristic frequencies (individual for each person and more or less independent of the message) can be identified. See fig 3 for a characteristic FFT

In this figure we can see the sound wave in the time domain (top plot), and in the frequency domain (bottom view). See the main peaks around 100 and 250 Hz.

Fig 3: In this figure we can see the sound wave in the time domain (top plot), and in the frequency domain (bottom view). See the main peaks around 100 and 250 Hz.

The Cepstrum analysis is given by the Inverse of the Fourier Transform (IFT) on the module of the spectrum:

It is interesting to see that this technique gets rid of the time domain with the FFT, and then the signal is transferred from the frequency domain the “quefrency” domain. Analysis in the time domain can be difficult because of the connection with the mood of the speaker and the pronunciation. Characteristics obtained from the cepstrum analysis will be also independent for each speaker. See Fig 4 for cepstrum analysis of the anterior wave. This technique has been widely used for speech and speaker recognition, as it can be seen in Sanchez, F. L et al (2006) and San Martin, C et al (2004)


The filtering process has not been a main part of this work. But it is important to take some considerations. First of all, as we have seen in fig 3 and fig 4, the signal is very noisy. There is an important component of noise around 0 Hz, and other low frequency noises, probably originated by the internal circuitry of the computer or the microphone itself. There are other components of noise at high frequencies (see fig 3 top view). When FFT is directly applied to a signal like this, we have many frequencial components that are not relevant for our analysis. As it is intended to get rid of these two components, a band-pass filter has been proposed. Some experiences have been carried with digital filters (See fig 5), and especially with FIR filters, more suitable for audio applications than IIR because of their finite response. This has been discussed by Capobianco, R. et al (2005) in their work a matched FIR filter bank for audio coding.

Cepstrum analysis of the wave. See the characteristic peaks around sample 80 and sample 250.

Fig 4: Cepstrum analysis of the wave. See the characteristic peaks around sample 80 and sample 250.

Application of a band pass FIR filter, where the component of DC and high frequencies have been cancelled. See that the spectrum is clearer now.

Fig 5: Application of a band pass FIR filter, where the component of DC and high frequencies have been cancelled. See that the spectrum is clearer now.

Processing Unit

The processing unit is in charge of taking decisions based on the results of the analysis. For this concrete application, where the signal processing aspect, and memory requirements are not very demanding, could be a Digital Signal Processor, a plain general purpose microcontroller or the combination of both. The selection of one or another depends on many aspects:

DSP: The use of filtering techniques is much easier with DSP, the power consumption is less than the one in a uC, and the memory availability greater. These devices are cheap and easy to update the software for enhancing the application.

uC: Ideal for controlling I/O, drive actuators. The board for a uC is less bulkier than the one DSP, and the wiring simpler. The availability of a watchdog is interesting for security systems. In the other hand, uC are very deficient for signal processing applications.

The number of Security systems intended to manufacture, available manufacturing process, expected robustness and cost will be determinant for choosing between the two. A good choice would be the use of a DSP for the signal processing function, and a FPGA (Field Programmable Gate Array) or another robust multipurpose chip for driving the whole system included possible actuators (buttons, locks, tamper proof, etc.

Model description

The security system is intended to work as a “finite state machine”, following a series of states in a circular way. See fig 6. The cycle starts with the acquisition of the signal:

Description of the finite states in the System. The system is intended to go cyclically through these stages.

·Acquisition is carried through a sensor (microphone), gathering the analogue signal. This signal, because of its analogue nature is continuous in time and can not be analyzed by the computer. The data acquisition card samples the sound wave at identical spaced intervals, converting it to the discrete domain. This array of samples can be digitally analyzed. The sampling rate will depend on the maximum frequency we want to analyze. The Nyquist Theorem stands that for avoiding aliasing “the sampling frequency must be at least double of the maximum frequency that it is intended to obtain”.

The sampling frequency has been set to 22050 Hz, with the aim of getting proper information up to 10KHz. And the acquisition time has been set to 3 seconds, time enough to say a word (password). This first stage is triggered by pressing a button, as a mean of energy saving thus the system doesn’t have to be constantly scanning the ambient for the sound signal, or the authorized speaker might be speaking next to the security system but not intending to access it.

·The analysis section transfers the signal from the discrete domain (related with time) to the frequency domain with the FFT, and then to the “quefrency” with Cepstrum. It is over this quefrency signal where the relevant/personal features of the signal are extracted. This part of the model has to be adjusted for each authorized speakers. Once the features are extracted, they are saved in the memory, and available for comparison with the unknown speaker signal. This signal from the unknown speaker will follow the same steps, and will be compared in the “quefrency”. Over the many possibilities to carry a comparison it has been decided to compare around significant peaks. Samples around these peaks are grouped and statistical analysis carried. Including maximum value, absolute value, average value and mean value See Table 1.

Table 1: Shows the results for five tests over the values in samples 70 to 90 in the quefrency domain, where lies a representative peak.

·The decision stage is clear. Based in the results from the analysis decisions are made on whether or not the speaker is recognized (grant or deny access to the security system). In this simulated model a message is displayed, but in a real model, the action could be communicate with an actuator to operate a locker when the speaker is recognized, or alert security staff when an unauthorized speaker intends to access more ant 3 consecutive times.


The model has been simulated in MATLAB environment. It is worthy to remember again that this is just the model for the control unit of the security system. The whole system is intended to have more components as has been discussed before. Thechosen way to do the analysis (cepstrum) is very interesting and widely used for audio applications, but it is as important as to choose the correct filter to get rid of irrelevant peaks and being able to do a more reliable and accurate system.

This models accuracy is acceptable. During the different testing sessions has been successfully tested, but distance speaker-microphone or ambient noise have been very determinant. This can be reduced if speaker and sensor are under a controlled environment, or with algorithms that cancel the influence of these factors

You can find here the code for MATLAB


Comments are closed.

%d bloggers like this: