Speaker Verification Under Degraded Conditions Using Empirical Mode Decomposition Based Voice Activity Detection Algorithm

M. S. Rudramurthy; V. Kamakshi Prasad; R. Kumaraswamy

doi:10.1515/jisys-2013-0085

Open Access Published by De Gruyter February 10, 2014

Speaker Verification Under Degraded Conditions Using Empirical Mode Decomposition Based Voice Activity Detection Algorithm

M. S. Rudramurthy , V. Kamakshi Prasad and R. Kumaraswamy

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2013-0085

Abstract

The performance of most of the state-of-the-art speaker recognition (SR) systems deteriorates under degraded conditions, owing to mismatch between the training and testing sessions. This study focuses on the front end of the speaker verification (SV) system to reduce the mismatch between training and testing. An adaptive voice activity detection (VAD) algorithm using zero-frequency filter assisted peaking resonator (ZFFPR) was integrated into the front end of the SV system. The performance of this proposed SV system was studied under degraded conditions with 50 selected speakers from the NIST 2003 database. The degraded condition was simulated by adding different types of noises to the original speech utterances. The different types of noises were chosen from the NOISEX-92 database to simulate degraded conditions at signal-to-noise ratio levels from 0 to 20 dB. In this study, widely used 39-dimension Mel frequency cepstral coefficient (MFCC; i.e., 13-dimension MFCCs augmented with 13-dimension velocity and 13-dimension acceleration coefficients) features were used, and Gaussian mixture model–universal background model was used for speaker modeling. The proposed system’s performance was studied against the energy-based VAD used as the front end of the SV system. The proposed SV system showed some encouraging results when EMD-based VAD was used at its front end.

Keywords: Voice activity detection (VAD); zero-frequency filter assisted peaking resonator (ZFFPR); empirical mode decomposition (EMD); speaker verification (SV); Gaussian mixture model–universal background model (GMM-UBM)

1 Introduction

Speaker recognition (SR) refers to the ability of a machine to recognize a person from his or her voice. The SR tasks are broadly categorized into three fundamental tasks: (i) speaker identification (SI), (ii) speaker verification (SV), and (iii) speaker diarization (SD). The research community is continuously striving to design the best SR system, which has a history of more than five decades [15]. During this period, a variety of SR tasks were defined in response to the increasing needs of our technologically oriented way of life. There are several survey, tutorial, overview, and review articles on SR, which include Atal’s survey on automatic SR (ASR) techniques [1], Gish’s survey on text-independent SI [16], Campbell’s tutorial on SR [6], Bimbot et al.’s tutorial [4], Reynold’s overview [46], Furui’s overview [14], Faundez-Zanuy and Monte Moreno’s overview of the state of the art in SR [9], Kinnunen and Li’s recent overview [27], Peacocke’s introduction to speech and SR [41], Rosenberg’s review on automatic SV (ASV) [50], and Tranter and Reynold’s overview of SD systems [56]. In SI, the task is to determine who is speaking when given a set of N known speakers’ voices without the use of identity claims. It operates in two modes: (i) the closed-set mode, in which the system presumes that the unknown voice from the speaker comes from the given set of N speakers’ voices, and (ii) the open-set mode, in which the voice from the speaker during testing may not come from the set of N speakers’ voices and the unknown speaker is known as an imposter. The SV task aims to determine whether the person is who he/she claims to be. This implies that the target speaker must provide a voice sample with an identity claim, and the SV system may accept or reject the speaker on the basis of a successful or unsuccessful verification. Furthermore, both SI and SV tasks are categorized as text dependent and text independent based on the speech modality. Most state-of-the-art ASR systems use text-independent SV tasks for forensic applications.

Most studies on the state-of-the-art SR tasks have been focused on regular databases which consist of set of high quality speech utterances for each speaker, collected by using high quality microphone under acoustically controlled conditions. Most of the contemporary SR systems provide excellent recognition performance under matched conditions, i.e., similar conditions are maintained during the training and testing phases [51]. Unfortunately, the performance of most commercial SR systems has degraded dramatically owing to mismatched conditions (i.e., environmental differences) between the training and testing phases. A good example of this mismatch is when training is accomplished with clean speech and testing is performed on noise-corrupted speech with channel noise or environmental noise. In realistic conditions, mismatch frequently occurs owing to various human and environmental factors (e.g., channel mismatch, noise effects) that greatly contribute to recognition errors. To overcome the problems encountered in current SR systems, the attention of researchers shifted to focus on front-end robust speech-processing techniques as a method of improvement [31]. The channel variability from training to testing, due to different signal-to-noise ratios (SNRs), kinds of microphone, evolution with time, etc., is the major cause for the mismatched condition. One approach to deal with the mismatch between the training and the testing phase is to reduce the degradation in both training and testing. In another approach, the speaker model parameters are biased toward the testing condition to alleviate the mismatch. The various techniques to handle mismatch in each of the above approaches is described in Refs. [42, 43]. However, a mismatched condition is not a serious problem for humans as they use different levels of information or higher-level cues that are least degraded by noise or channel mismatch [9].

The motivation for the present study has been obtained from recent works on SV under degraded conditions and reported observed facts from nature. Mostly, human individuals rely on perceived voice regions from speech utterances produced by the speaker for SR tasks, while discarding the background noise, environmental noise, and channel noise. Among humans, the ability to categorize voiced and unvoiced speech under a variety of environmental settings in conversational speech mostly begins in early infancy [36]. It was observed that 30-day-old infants, under some conditions, are capable of distinguishing their mother’s voices from other female voices [34, 35, 37]. According to those studies, when the mother speaks in her usual fashion, her speech is not only intonated but also addressed directly to the infant. Most human individuals ignore the silence and background noise and perceive only the high-SNR voiced regions of speech in search of the excitation source information, which carries significant speaker-specific information for the recognition of the other human individual. In conversational speech, voiced regions are high-SNR regions least degraded by noise. Recently, vowel and non-vowel-like region segmentation, and the significance of vowel-like regions for SV have been studied in Refs. [29, 30]. The process of separating conversational speech and silence is called voice activity detection (VAD) [2], first explored in Ref. [5]. Primarily, VAD is the front-end subsystem of any SR system. VAD receives the speech utterances from the target speaker under a variety of degraded conditions. VAD decides the quality of speech data to be processed. Therefore, the performance of VAD has a profound impact on the performance of SR systems. The performance of most VAD deteriorates with decreasing SNRs under degraded conditions. Since the last two decades, numerous researchers have studied different strategies for detecting speech in noise and the influence of the VAD decision on speech-processing systems [8, 12, 13, 24, 25, 33, 54]. Therefore, there is a need for the development of noise robust VAD that could make the SR systems more robust and provide improved performance under degraded conditions by reducing the mismatch between training and testing.

VAD techniques that use a spectral shape as a feature tend to lose the formant structure under low SNRs. Traditionally, VAD methods use acoustic features such as energy, short-term energy, and zero-crossing rates [58]; periodicity measure [57]; linear predictive coefficients [44]; higher-order statistics [28]; long-term spectral divergence [45]; and cepstral features [18]. Unfortunately, the performance of most VAD methods employed in the front end of an SV system degrades under low SNR and when noise is non-stationary. Therefore, VAD remains an unsolved problem in most speech-processing tasks today. Most of the existing VAD methods are not adaptive and data driven. Real-world data is generated from non-linear non-stationary and stochastic processes. The production of a speech signal is known to be the result of a dynamic process that is both non-linear and non-stationary [19]. Any VAD method has to be adaptive and data driven to analyze the incoming non-linear non-stationary data, and to mitigate the mismatch conditions in the front end of the SR systems.

An adaptive, a posteriori data-driven empirical mode decomposition (EMD) algorithm [22] is well suited for the analysis of non-linear and non-stationary data such as speech. Its ability to reject stationary white noise and fractional Gaussian noise due to its dyadic filter bank property is mostly an added advantage. Recently, an adaptive VAD algorithm using zero-frequency filter (ZFF) assisted peaking resonator (ZFFPR) and EMD has been proposed in Refs. [52, 53]. The basic underlying principle in this method is that the EMD decomposes the non-linear non-stationary speech data into a set of discrete modes of oscillations embedded in data called intrinsic mode functions (IMFs). The selection of an appropriate IMF, called the characteristic IMF (CIMF), which dominantly contains the source excitation information providing an important cue about the quality of the voiced region in speech data, is accomplished using ZFFPR. The high-SNR voiced speech regions are extracted using this technique, and the extracted voiced regions are further used for feature extraction. The performance of this VAD algorithm is studied for speech recognition tasks with the TIDIGITS database under low SNRs in Ref. [39], and the performance of this method is compared against the baseline hidden Markov model (HMM)-based speech recognition system with energy-based VAD used at the front end. Studies have shown that the method can perform better than the energy-based VAD under low SNRs. In this article, an adaptive data-driven method for VAD using ZFFPR and EMD is integrated to the front end of the text-independent SV system, and the performance is studied under degraded conditions with the subset of the NIST 2003 database consisting of 50 speakers (25 males and 25 females) at SNR from 0 to 20 dB. Furthermore, the performance is compared against the baseline system that uses energy-based VAD at its front end.

This article is organized as follows: Section 2 describes EMD; Section 3 describes an adaptive VAD algorithm using ZFFPR and EMD; Section 4 describes a text-independent SV system with Gaussian mixture model (GMM) and universal background model (UBM); and Section 5 discusses the experiments and results.

2 Empirical Mode Decomposition (EMD)

In nature, most data are generated from non-linear non-stationary and stochastic processes. For example, the speech production system in humans is a dynamic non-linear system that produces non-linear non-stationary speech signals. Traditionally, the non-linear processes can be viewed in two ways: Fourier view and Poincare view. The differences between the Fourier view and the Poincare view and their limitations are described in Ref. [23]. An alternative to these views of non-linear mechanics, a new view called Hilbert view, has been proposed. The Hilbert view is based on EMD and Hilbert spectral analysis as described in Refs. [21, 22]. Non-linear Duffing pendulum described by a non-dissipative Duffing equation is a good example of a non-linear system. The non-dissipative Duffing equation representing the non-linear system is given in Ref. [20] as

(1)d2xdt2 + x + εx3 = γcos(ωt), (1)

where the parameter ε is not necessarily small and γ is the amplitude of a periodic forcing function with a frequency ω. If parameter ε were zero, the system is linear; else, the system would be non-linear. If the system is linear (i.e., when ε << 1), it could be solved using perturbation methods. If ε is not small compared to unity, then the system would be highly non-linear and lead to other phenomena such as bifurcations and chaos, and perturbation methods are in no way applicable. In a classical article [22], Huang et al. noted an interesting fact that intrawave frequency modulation is the hallmark of any non-linear system. The concept of intrawave frequency modulation can be better understood by rewriting the above equation as

(2)d2xdt2 + x(1 + εx2) = γcos(ωt). (2)

Here, the term within the parenthesis is considered a variable spring constant or interpreted as the variable pendulum length. As the frequency of oscillation of the pendulum depends on its length from the rigid point to which it is hanging, the length of the pendulum varies during its oscillation and the frequency of oscillation changes from location to location and from time to time. This is called intrawave frequency modulation and is proposed as an indicator of non-linearity. This means that the instantaneous frequency of the system changes within one oscillation cycle. Therefore, the physically meaningful way to describe the system is in terms of its instantaneous frequency, which will uncover the intrawave frequency modulation hidden in the behavior of such systems. Analysis of such a system, as in the past, using linear Fourier analysis and this intrawave frequency modulation could not be depicted, except by resorting to harmonics. Therefore, any non-linear distorted wave has been referred to as a harmonic distortion. In fact, harmonic distortions are mathematical artifacts imposing a linear structure on a non-linear system. They may have a mathematical meaning, but not a physical meaning [10, 17]. Now, the task is shifted to how to compute instantaneous frequency from the real valued signal. Instantaneous frequency can be computed by representing the signal in an analytic method by using the Hilbert transform (HT). The complex conjugate y(t) of any real valued signal x(t) is the HT of x(t). Therefore, using HT, the analytic signal is defined as

(3)z(t) = x(t) + iy(t) = a(t) ei θ(t). (3)

From the above, the instantaneous amplitude and instantaneous frequency can be computed as follows: instantaneous amplitude,

(4)a(t) = (x2 + y2)12, (4)

and instantaneous phase,

(5)θ(t) = arctan(yx); (5)

then the first derivative of the instantaneous phase providing the instantaneous frequency ω(t) is given by

(6)ω(t) = dθ(t)dt. (6)

The instantaneous frequency f(t) can then be defined as

(7)f(t) = 12πdθ(t)dt, (7)

in terms of derivative of phase θ(t). The discrete time instantaneous frequency ω(n) is computed by a central difference scheme as

(8)ω(n) = 12πθ(n + 1) − θ(n − 1)2T, (8)

where T is the time interval. Thus, a given time n corresponds to a frequency ω(n) and an amplitude a[n]. Thus, on the (n, ω)-plane, each point corresponds to an amplitude that is a function of both time n and frequency ω, but the time n and frequency ω are not independent; rather, they are related by a function ω(n). The triplet (n, ω(n), a(n)) determines a point in the three-dimensional space (n, ω, a). For a given n, find a point ω(n), hence a point on the (n, ω)-plane. One can find this a[n] for all IMFs and hence for many amplitudes on the (n, ω)-plane. These amplitudes form the discrete Hilbert spectra referred to as the Hilbert amplitude spectrum (HAS). The differentiation of phase θ(t) yield the instantaneous frequency f(t). The derivative must be well defined, as physically there can only be one instantaneous frequency value f(t) at a given time t. This is ensured by the narrow band condition: the signal must contain nearly one frequency. Furthermore, as detailed by Xu and Zhang [60], the HT produces a more physically meaningful result the closer its input signal is to being narrow band. The EMD combined with HAS provides time–frequency–amplitude representation of the original data generated from non-linear non-stationary and stochastic processes, and the spectrum is called the Hilbert–Huang spectrum (HHT). Previously, the practical applications of HT are all limited to narrow-band-passed signals, which is narrow banded with the same number of extrema and zero crossings. Many mathematical formalities demanded by the HT are described in Ref. [17]. Most of the real-world signals are broadband signals. Therefore, a broadband non-linear non-stationary signal is to be decomposed into a set of narrow-band signals (mono components) and is the key motivation behind the development of EMD. EMD enables us to identify system non-linearity from its output alone by expressing the system behavior in terms of instantaneous frequency. The novelty of HT is revealed only after the development of EMD. EMD considers the signal generated from non-linear non-stationary stochastic processes at their local level to formalize the idea that fast oscillations are superimposed to slow oscillations, and to iterate on the slow oscillation components being considered a new signal. This one-dimensional decomposition technique extracts a finite number of oscillatory components or “well-behaved” AM–FM functions, called IMFs, directly from the data [23]. EMD extracts a set of IMFs from the data without making any prior assumptions that the data is stationary and linear, using a recursive algorithm called shifting algorithm or shifting procedure [22]. The shifting procedure is based on two constraints:

It assumes that the signal has at least two extrema.
An IMF is a function that satisfies two conditions:
- The whole data set, the number of extrema, and the number of zero crossings must either equal or differ at most by one, and the mean value must be zero.
- At any point, the mean value of the envelope defined by the local maxima and the envelope defined by the local minima is zero.

EMD, as originally proposed, is implemented through a sifting process that is summarized as follows:

Step 1: Initialize r_o(t) = x(t), i = 1, r_i(t) = r_o(t).

Step 2: Procedure to extract the i^th IMF:

Initialize h₀(t) = r_i(t), J = 1.
Extract all the local minima and maxima of h_J–1(t).
Interpolate the local maxima and the local minima by a cubic spline to construct the upper and lower envelopes of h_J–1(t).
Calculate the mean m_J–1(t) of the upper and lower envelopes.
h_J (t) = h_J–1(t) – m_J–1(t).
If IMF_i(t) = h_J(t), else go to (b) with J = J+1.

Step 3: r_i₊₁(t) = r_i–1(t) – IMF_i(t).

Step 4: If r_{i_{+ 1}}(t) still has at least two extrema, then execute step 2 with i = i + 1 and the decomposition procedure ends.

Finally, when the EMD procedure is completed after n iterations, the original signal can be reconstructed as

(9)x(t) = ∑J=1nIMFJ + rfinal(t), (9)

where n is the number of IMFs, IMF_J is the k^th IMF, and r_final is the final residue (monotonic function). In practice, no data is real data for the reason that noise is part of our existence. In practice, our data is an amalgamation of signal s(t) and noise n(t), and can be written as

(10)x(t) = s(t) + n(t). (10)

Noise may come in many ways in our scientific studies. It may come from sensors, recording systems, transmission channels, or it could be part of the natural processes generated from local instabilities. The characteristics of white noise are studied using the EMD method of Wu and Huang [59]. Their study reveals that the EMD effectively acts as a dyadic filter bank. A dyadic filter bank is a collection of band-pass filters that have a constant band-pass shape (e.g., a Gaussian distribution) but with neighboring filters covering half or double of the frequency range of any single filter in the bank. The frequency ranges of the filters can be overlapped [11]. It shows that the IMF components are all normally distributed, and the Fourier spectra of the IMF components are all identical and cover the same area on a semilogarithmic period scale. It helps assign a statistical significance of information content of IMF components from any noisy data. EMD as a dyadic filter bank, resembling those involved in wavelet decompositions, effectively filters white noise and fractional Gaussian noise [11]. This property is somewhat attractive for denoising speech signal. More details on the statistical significance of the IMFs obtained through EMD are described in Refs. [20, 59, 60]. Like any other methods, EMD has one major limitation: mode mixing. Mode mixing was first described in Ref. [22]. During the practical sifting process to obtain IMFs, EMD encounters mode mixing – the phenomenon in which different time scale data coexist in the same IMF, which lacks physical meaning. Mode mixing is defined as any IMF consisting of oscillations of dramatically disparate scales, often caused by the intermittency of the driving mechanisms. When mode mixing occurs, an IMF can cease to have physical meaning by itself, suggesting falsely that there may be different physical processes represented in a mode. The research community has strived to eliminate the mode mixing phenomenon in EMD. EMD determines whether data contain one or two frequencies provided the components differ in frequencies substantially [49]. Let f₁ and f₂ be the frequencies of two harmonics, and A₁ and A₂ are the amplitudes of two harmonics. EMD can separate these two harmonics when they, in composition, provided A2A1 ≤ [f2f1]−2. For example, let the original signal of interest with amplitude A= 1.0 and frequency f₁ = 10 Hz, and another high-frequency signal represent noise with amplitude B= 0.5 and frequency f₂ = 50 Hz. These two signals are added to synthesize the noisy signal, i.e., original signal + noise as shown in Figure 1A. These two signal components are well separated when the noisy signal, i.e., the noise-added signal, is decomposed using EMD as shown in Figure 1B. In this case, the EMD of the noise-added signal results in two IMFs.

Figure 1

(A) Synthesis of Noisy Signal. (B) EMD of Noisy Signal.

For example, a second harmonic with a triplet or lower frequency f₂ ≤ f₁ and a small amplitude <A₂ ≤ 0.11A₁ cannot be separated by EMD. This means that EMD does not perform well for smaller amplitudes of the second harmonic and cannot distinguish frequencies that are close together. For example, if frequencies lie within an octave of each other f₂ ≤ f₁ and their amplitudes differ by less than one quarter A₂ ≤ 0.25A₁, the application of the EMD method is unable to separate these two components [10]. Figure 1B shows that EMD can separate low-frequency components from high-frequency components. This an attractive property of EMD that is used in this proposed method to extract adaptively the low-frequency excitation source information from high-frequency system resonances and noise present in the speech signal. The mode mixing phenomenon in EMD may not be a bottleneck in this proposed method. The main reason for this is that the component of system resonances exists at a much higher frequency than the frequency of the excitation source component present in the speech signal.

3 Adaptive VAD Algorithm Using ZFFPR and EMD

3.1 Zero-Frequency Filter

The ZFF, basically an all-pole filter, was first introduced by Murty and Yegnanarayana [38]. The characteristics of an all-pole filter are that it has a frequency response function that goes infinite (poles) at specific frequencies, but there are no frequencies where the response function is zero. Speech is a result of the convolution of the time-varying vocal tract transfer function and impulse-like excitation signal generated by the periodic vibration of vocal folds during the production of voice. The instant of significant excitation of the vocal tract system is referred to as the epoch. The instant when significant excitation is delivered to the vocal tract system is embedded in the coupling between the source and the filter. For epoch extraction from the speech signal, the zero-frequency resonator filter exploits the fact that vocal tract system resonances exist at much higher frequencies than at the zero frequency, and the characteristics of the time-varying vocal tract system will not affect the characteristic information at the discontinuities due to impulse-like excitation in the resonator filter output, when the speech signal passes twice through it [38]. The ZFF is realized using the following three steps:

Difference the speech signal s[n] (to remove any time-varying low-frequency bias in the signal). This corresponds to the pre-emphasis on speech signal as in Eq. (11) given below:
(11)x[n] = s[n] − s[n − 1]. (11)
Pass the differenced speech signal x[n] twice through an ideal resonator at zero frequency. That is,
(12)y1[n] = − ∑k=12aky1[n − k] + x[n],y2[n] = − ∑k=12aky2[n − k] + y1[n], (12)
where a₁ = –2 and a₂ = 1. As described in the article, it is equivalent to four successive integrations.
Remove the trend in y₂ [n] by subtracting the average over 10 ms at each sample. The resulting signal is given by Eq. (13) below:
(13)y[n] = y2[n] − 12N + 1∑y2[n + m], (13)
where y[n] is called the ZFF signal and the term 2N+ 1 corresponds to the number of samples in the 10-ms interval.

3.2 Peaking Resonator (PR)

The peaking resonator (PR) filter can be designed in a fashion similar to that of the notch filter as described in Ref. [40]. The transfer function of the second-order analog resonator filter is given in Eq. (14) below:

(14)Ha(s) = αss2 + αs + Ω02, (14)

which has the frequency and magnitude response

(15)Ha(Ω) = jαs−Ω2 + jαΩ + Ω2 (15)

and

(16)|Ha(Ω)|2 = α2Ω2(Ω − Ω02) + α2Ω2, (16)

respectively. Note that H_a(Ω) is normalized to unity gain at the peak frequencies Ω = Ω₀. The bandwidth frequencies Ω₁ and Ω₂ will satisfy the bandwidth condition:

(17)|Ha(Ω)|2 = α2Ω2(Ω − Ω02) + α2Ω2 = GB2, (17)

in quartic form

(18)Ω4 − (2Ω02 + 1 − GB2GBα2)Ω2 + Ω4=0. (18)

On simplifying the above equation using bilinear transformation as described in Ref. [35], the discrete domain transfer function is given by

(19)H(z) = (1 − b)1 − z−21−2bcosω0z−1 + (2b − 1)z−1, (19)

where H(z) is the transfer function of the digital PR filter. The frequency response of the ZFF and PR is shown in Figure 2A and B, respectively.

Figure 2

(A) Frequency Response of the Zero Frequency Filter. (B) Frequency Response of the Peaking Resonator.

3.3 Zero-Frequency Filter Assisted Peaking Resonator (ZFFPR)

The speech signal is decomposed using an adaptive data-driven EMD. This provides a set of IMFs. The IMFs are produced in the order of increasing scale. The set of IMFs that are produced during EMD of noisy real data can be categorized into three types [26]:

Noisy: IMFs 1–4 are wide band as they clearly contain noise.
Transition: IMFs 5–7 contain both signal and noise. These IMFs capture the transition from the noise captured in IMFs 1–4 to the monochromatic components extracted as IMFs 8–11.
Monochromatic: IMFs 8–11 are nearly monochromatic and yield meaningful IF contributions.

The selection of an IMF that is physically meaningful among the set of IMFs that represents the excitation source information is a challenging task. One simple approach is selecting an IMF that has the highest energy. However, the energy of the IMF alone should not be the criterion. The physically meaningful IMF should contain information about the instant of significant excitation of the vocal tract, i.e., epoch information and also significant energy. The regions that immediately follow the epoch are more robust to external degradation, which play an important role in human perception, and because of these epochs in speech, human individuals are mostly able to perceive speech at a considerable distance from the source while spectral components of the signal experience attenuation around 60 dB as described in Ref. [38].

To solve the problem of selection of an IMF that represents the excitation source information from the set of IMFs produced during decomposition of speech signal using EMD, the ZFFPR is used. The ZFF is combined with the PR in such a manner that the PR resonates at a frequency determined by the ZFF. This combination is known as the ZFFPR. The speech signal is passed through the ZFF and the fundamental frequency of the speech signal; that is, the reciprocal of the period between the glottal closure instants called epochs is determined. The PR is then set to resonate at this determined fundamental frequency. EMD of the speech signal provides a set of IMFs. Among the set of IMFs, the particular IMF that transfers maximum energy through the PR when passed through it when the PR is made to resonate at the predetermined fundamental frequency is referred to as the CIMF. When ZFFPR is used in the EMD space, it can become an efficient a posteriori-based adaptive data-driven strategy for VAD and voiced and unvoiced speech classification.

3.4 Design Framework for the VAD Algorithm Using ZFFPR and EMD

In this article, an adaptive data-driven framework for voiced speech detection using EMD combined with ZFFPR, as proposed in [52, 53], is shown in Figure 3. In this framework, speech is decomposed into a set of IMFs using EMD. In this article, the zero-frequency filter (Infinite Impulse Response filter) is combined with the PR to select physically meaningful IMFs. The PR can be made to resonate at a specified frequency called the resonant frequency (f_r) derived from ZFF. If the incoming signal to the PR contains the component at the specified resonant frequency (f_r) dominantly, then the maximum energy is transferred through the PR; that is, the PR in the pass-band region or the frequency response function of the PR is maximal. This can be ensured by computing the energy of the output signal of the PR. If the incoming signal does not contain any component at the specified frequency (f_r), then no component is passed through the PR; that is, the energy transfer through the PR is zero or minimal. In other words, the frequency response function of the PR is minimal. It means that the PR is in the attenuation band region. In this way, the PR can be used like a lens to inspect whether the incoming signal to the PR contains the component at the specified resonant frequency (f_r). It is well established that speech is a result of the convolution of impulse-like excitation produced by the vibration of vocal folds during voicing and the impulse response of the vocal tract transfer function. Vocal tract resonances exist at a frequency much higher than the excitation signal frequency. The frequency of impulse-like excitation to the vocal tract system is called the fundamental frequency, and the reciprocal of the fundamental frequency, f₀, is called the pitch period, a characteristic feature of the speaker. If the PR is resonated by the rough estimation of f₀ determined by any of the existing methods for a speech data collected from a particular speaker, the PR is capable of identifying the characteristic IMF among the set of IMFs, which contain speaker-specific information dominantly, from which the glottal activity region and the non-glottal activity region in the speech data can be made, when the IMFs are passed through it. Furthermore, it was established that ZFF is noise robust and can be effectively used to estimate the f₀ from the speech data for a particular speaker even under the degraded condition. This has motivated us to combine the PR and the ZFF such that the PR is made to resonate at an f₀ derived from the ZFF. Now, when the speech signal is decomposed using EMD and simultaneously passed through the ZFF, the PR resonated by the f₀ determined from the ZFF and the PR is capable of identifying the characteristic IMF among the set of IMFs obtained from the EMD of speech data, from which the classification of voiced and unvoiced regions in the speech data can be made. As ZFF itself is capable of performing this task, such a system of a combination of PR and ZFF is of little importance. However, this combination can be effectively used as a new method for the selection of IMF from the set of IMFs produced through EMD of speech data. The voiced and unvoiced speech detection algorithm using ZFFPR in the EMD space is given below:

When speech is decomposed using EMD, it produces a set of IMFs and residue, i.e., trend.
Identification and selection of an IMF from the set of IMFs that dominantly consist of source excitation information is accomplished by combining the PR with ZFF.
Speech data is simultaneously passed through the ZFF and decomposed using the EMD that yields the set of IMFs and residue. Using the ZFF, the f₀ of speech is determined and the PR is set to resonate at this f₀.
The energy of each IMF is computed before passing each IMF through the PR.
Each of the IMFs is passed through the PR and the energy of the IMF at the output of the PR is computed. In this process, the filtered IMF with the maximum energy transfer through the PR is called the CIMF. This is the IMF that contains glottal information dominantly at the specified fundamental frequency, which was derived from the ZFF. This step is different from that used in our earlier proposed method in Refs. [52, 53].
The CIMF preserves well all the characteristics of the original speech data. This CIMF is expected to provide an important cue to make voiced and unvoiced decisions.

Figure 3

Adaptive Data-Driven Framework for VAD using ZFFPR and EMD.

The method of identification and selection of the IMF that contains the excitation source information is the first of its kind among the various criteria for the selection of IMFs. This approach is used to develop an adaptive data-driven framework for VAD in the EMD space. As system resonances exist at a frequency much higher than the frequency of vibration of the vocal folds, the mode mixing phenomenon, which is the major limitation in EMD, is not a bottleneck in this framework. In case mode mixing exists, there are several techniques to alleviate this mode mixing phenomenon; however, the discussion of techniques to eliminate mode mixing is beyond the scope of this article. Some of the salient properties of this proposed method are as follows:

The method is adaptive and data driven (inherent property of EMD).
The method does not rely on any a priori statistics of the noise present in the speech data.
The ZFFPR determines the physically meaningful IMF that consists dominantly of the source information from the set of IMFs produced from EMD of the speech data irrespective of the environment from which the speech data is collected.
The method is suitable for all data that are non-linear non-stationary and linear stationary (inherent property of EMD).

4 SV Using GMM–UBM

4.1 Database

The performance of an SV system is evaluated on a subset of the NIST 2003 SR database consisting of 50 speakers for testing and 50 speakers for training. Among 50 speakers, 25 were males and 25 were females. This study includes the development of a baseline SV system with the energy VAD integrated at the front end, and the proposed adaptive VAD using ZFFPR and EMD integrated at the front end of the SV system separately. Then, six different types of noises, such as white noise, hfchannel noise, pink noise, volvo noise, buccaneer1, and destroy engine noise, are considered from the NOISEX-92 database to create the noise-degraded speech at the SNR levels of 0, 5, 10, 15, and 20 dB. The performance of the SV system is then evaluated on the database that is the subset of the NIST 2003 database for the original training speech and the noise-degraded test speech to study the performance of the baseline SV system with the energy VAD at the front end and the proposed SV system that uses an adaptive VAD using ZFFPR and EMD integrated at its front end.

In this study, the database is limited to only 50 speakers owing to the algorithmic complexity of the EMD algorithm that takes a long time to decompose the speech utterance of each speaker. Furthermore, integrating EMD into the traditional SR system with the conventional signal-processing method is new of its kind and is yet to mature. This method may be considered as a feasibility study to understand how the traditional SR system performs in the EMD space under degraded conditions.

4.2 Block Diagram of the SV System

The block diagram showing the modular representation of an SV system is given in Figure 4. The SV processes are divided into two phases of operation: training phase and testing phase. In the training phase, for each speaker, a reference model (target model) is built using the training speech. In the testing phase, the similarity between the test speech and the claimed model is compared against a verification threshold to make the verification decision. The speech data used for the development of speaker models and testing contains many other redundant and unwanted information. It is therefore processed through several intermediate stages to suppress the redundant information present in the acoustic speech signal, as follows.

Figure 4

Modular Representation of an SV System. (A) Enrollment Phase. (B) Testing Phase.

4.3 Voice Activity Region Extraction Using Energy-Based VAD and EMD-Based VAD

Most SV systems employ VAD in their front end. During the training and testing phases of an SV system, speech data is first processed through a voice activity detector to separate the speech regions from the non-speech regions. Although VAD is a simple binary classification task, it is very difficult to implement a VAD that works consistently for different types of speech data. Owing to the simplicity in implementation and less computational complexity, for most of the SV studies, signal energy is used for VAD. The short-term energy is computed for each analysis frame using the normalized speech signal [–1, 1] and the frames having energy above a certain threshold are considered as the speech frames. Only these frames are processed further for SV. In this study, the baseline SV system uses energy-based VAD in its front end and the proposed SV system uses adaptive VAD algorithm using ZFFPR and EMD [52] in its front end. The most challenging task is to control the right number of sifting iterations for the IMFs obtained during EMD of data to be meaningful. As described in Refs. [20, 22, 59], the number of sifting iterations is usually chosen to be between 10 and 15. In this study, the number of iterations for sifting is fixed at 10 and the number of IMFs required is also fixed at 10, with an SD value of 0.5. The selection of CIMF from the set of IMFs using ZFFPR is accomplished with a fixed bandwidth of 30 Hz for PR, and it is made to resonate at the fundamental frequency determined by the ZFF. The determined CIMF from the set of IMFs obtained through EMD is subjected to block processing with a frame size of 20 ms and a frame shift of 10 ms, and the frame energy is computed. Then, traditional signal processing steps are followed to extract the voice activity regions in speech as described in Refs. [52, 53].

4.4 Feature Extraction

The voice activity regions extracted from speech using energy VAD and EMD-based VAD are transformed into a parameter representation (called feature) that encapsulates speaker identity, which is referred to as feature extraction, a most critical stage in any pattern recognition task. Among several paradigms for feature extraction, the Mel frequency cepstral coefficient (MFCC) paradigm, first introduced in Ref. [7], preserved its predominance in the area of speech and SR. The MFCC feature extraction method is employed in this work. In the training and testing process, the speech signal is processed in frames of 20-ms duration at 10-ms frame rate. For each 20-ms Hamming windowed frame, MFCCs are calculated using 22 logarithmically spaced filter bank structures designed to match the frequency sensitivity of the human ear [7]. The first 13 coefficients, excluding the zeroth coefficient value, are used as a feature vector. The delta and delta–delta parameters of the MFCCs are computed using two preceding and two succeeding feature vectors from the current feature vector [55]. Thus, the feature vector will be of 39 dimensions with 13 MFCCs, 13 ΔMFCCs, and 13 ΔΔMFCCs.

4.5 Speaker Modeling

The GMM–UBM is a widely accepted modeling technique in the SR community. The GMM is a stochastic model widely used in text-independent SR tasks [47]. An important characteristic of GMM is that it aims to represent the mean, i.e., the distribution, and the variance, i.e., the scattering around mean, of the feature vector in a multidimensional space and it assumes the distribution of data to be Gaussian. It adopts the multivariate Gaussian probability density for parameterization. Then, pattern matching is simply formulated as measuring the probability density (or the log-likelihood) of an observation vector given the speaker model. The likelihood of an input feature vector given by a specific GMM is the weighted sum over the likelihoods of the M unimodal Gaussian densities. A specific GMM that represents the likelihood of an input feature vector is given by

(20)p(xi|λj)=∑i=1Mwi b(xi|λj), (20)

where b(x_i∣λ_j) is the likelihood of x_i given the model j^th Gaussian mixture.

(21)b(xi|λj) = 1(2π)D2|∑j|12e12[(xi−μj)T∑j−1(xi−μj)], (21)

where D is the dimension of the vector, and μ_j and Σ_j are the mean and covariance matrices of the training vectors. The sum of the mixture weights ω_j is 1 and constrained to be positive. The GMM model (λ) parameters ω_j, μ_j, and Σ_j are estimated from the training feature vectors using the maximum likelihood criterion through an expectation maximization algorithm [3]. An extension of this modeling approach is the GMM–UBM approach, which has become most popular in text-independent SV tasks. The GMM–UBM approach adapts the speaker model from a UBM instead of training the speaker GMM directly using training speech. A UBM is a model used in text-independent SV tasks to represent general, person-independent feature characteristics to be compared against a model of person-specific feature characteristics when making an “accept or reject” decision [48]. This approach has several advantages: it can be used for score normalization in classification, it requires less training data than training the speaker model directly, and it is very useful in real-time implementations where the amount of training data from the speakers is usually restricted. Hence, the extensively used GMM–UBM-based speaker modeling is employed in this study. The UBM is represented by a weighted sum of C component densities as, c= 1, 2, …, C, where ω_c, μ_c, and Σ_c are the weight, mean vector, and covariance matrix associated with each mixture c, respectively. The speaker-dependent models are built by adapting the components of the UBM with the speaker’s training speech using the maximum a posteriori algorithm [48]. During the testing stage, the log-likelihood scores are calculated from both the claimed model and the UBM.

4.6 Performance Comparison

In SV, the goal is to determine whether the person is who he/she claims to be. This demands that the user must provide an identity, and the SV system just accepts or rejects the user according to a successful or unsuccessful verification. The performance of an SV system can be determined using the false acceptance rate (FAR; i.e., situations where an imposter is accepted) and the false rejection rate (FRR; i.e., situations where the speaker is incorrectly rejected), which are, in detection theory, referred to as False Alarm and Miss, respectively [9]. In most biometric system implementations, there exists a threshold that the user can choose to trade-off toward either side. Usually, these error rates are expressed in percentage. The performance of most biometric systems, including SV, is expressed in terms of the equal error rate (EER), which is a point at which there is an equal chance of false rejection (FRR) and false acceptance (FAR). This EER point is usually marked over the detection error trade-off (DET) curve, which plots the Miss probability in percentage form versus the False Alarm probability also in percentage [32]. The relative improvement in the performance of the proposed SV system is compared to the baseline system using EER as follows:

(22)EERR = EERB − EERPEERB*100, (22)

where EER_R, EER_B, and EER_P are the relative improvement in EER, the EER of the baseline SV system, and the EER of the proposed SV system using EMD-based VAD, respectively.

5 Experiment, Results, and Discussion

In this experiment, the performance of the VAD algorithm using ZFFPR and EMD, referred to as the EMD VAD, is studied for wide-range noise-degraded speech at various SNR levels ranging from 0 to 20 dB. Also, for UBM speech (251 male and 251 female), the proposed EMD VAD is studied under a degraded condition with six types of noises chosen from the NOISEX-92 database at SNR levels ranging from 0 to 20 dB. Figure 5A and B show the typical performance of the EMD VAD for 180 s of male speech degraded with white noise and the hfchannel noise, respectively, at the SNR level of 0 dB. Similarly, the performance of the EMD VAD for UBM 251 male and 251 female speaker utterances after degradation with six types of noises with SNR level varying from 0 to 20 dB is studied extensively among subjects.

Figure 5

Performance of EMD-Based VAD. (A) VAD for 180-s UBM Male Speech with White Noise. (B) VAD for 180-s UBM Male Speech with Hfchannel Noise.

After the performance of EMD VAD is ensured, the voice activity regions from 180 s of UBM data (251 male and 251 female) are extracted using EMD VAD and the beginning and end of each voiced region are determined. A similar procedure is employed for the NIST 2003 subset consisting of 50 speakers’ speech data in the test set and the train set (25 male and 25 female speakers). Then, test speech utterances are degraded using white noise, hfchannel noise, volvo noise, buccaneer1, and destroy engine noises separately using the NOISEX-92 database at various SNR levels of 0 to 20 dB, and the voiced activity regions are extracted using EMD VAD and the corresponding reference markings are obtained in each case separately. These procedures are similarly applied with energy-based VAD for the NIST 2003 test set, train set, and 180 s of UBM data (251 male and 251 female). Then, the voice activity regions extracted using energy-based VAD and EMD VAD are segmented into frames with a size of 20 ms. The frames are then processed with a frame shift (frame overlap) of 10 ms and a feature vector of 39-dimension MFCCs is obtained as described in Section 4.4 above, for the NIST 2003 test set, train set, and UBM data (251 male and 251 female).

Then, using the extracted features from voiced regions for energy VAD and EMD VAD, two gender-dependent 512-mixture GMMs are built, one for male speech and other for female speech, for both the baseline system and the proposed system. Finally, a 1024-mixture gender-independent UBM is built by pooling the two models (a model represents male and female speakers) and normalizing the weights for both the baseline system and the proposed system. During the time of model adaptation and testing, the respective UBM is used. In both systems, i.e., baseline system with energy VAD and the proposed system with EMD VAD, training is accomplished with the original NIST 2003 speech corresponding to 50 speakers (25 males and 25 females) from the train set, and testing is carried out using the original NIST 2003 speech from the test set and degraded test speech with six different types of noises, such as white, hfchannel, volvo, pink, buccaneer1, and destroy engine, chosen from the NOISEX-92 database separately. The performance of the baseline SV system and the proposed SV system is evaluated using EER in percentage, as described in Section 4.6 above, and the DET curve, which plots the Miss probability in percentage form versus the False Alarm probability also in percentage form in each case corresponding to the degraded conditions, is shown in Figure 6.

Figure 6

Performance of SV in EER(%) (A) for NIST Speech, (B) NIST Speech Degraded with 0 dB with White Noise, (C) NIST Speech Degraded with 5 dB White Noise, (D) NIST Speech Degraded with 10 dB White Noise, (E) NIST Speech Degraded with 15 dB White Noise, and (F) NIST Speech Degraded with 20 dB White Noise.

The performances in terms of EER(%) for the baseline SV system with front-end energy VAD (EER shown in light shade) and the proposed SV system with front-end EMD VAD (EER shown in bold font) under degraded conditions are given in Table 1.

Table 1

Performance of SV in Terms of EER_R(%) with the Energy VAD (EER shown in light shade) and the Proposed EMD VAD (EER shown in bold font) as the Front-End Signal-Processing System.

Noise type	Type of VAD	SNR levels
Noise type	Type of VAD	0 dB	5 dB	10 dB	15 dB	20 dB
White	Energy VAD	40	42	38	34	32
	EMD VAD	38	18	12	10	8
Hfchannel	Energy VAD	42	40	40	36	34
	EMD VAD	36	20	18	14	12
Pink	Energy VAD	46	40	36	34	32
	EMD VAD	38	28	12	10	6
volvo	Energy VAD	34	34	34	30	30
	EMD VAD	20	12	8	6	6
Buccaneer1	Energy VAD	42	38	36	34	34
	EMD VAD	41	32	20	10	10
Destroy engine	Energy VAD	40	40	36	30	30
	EMD VAD	38	18	12	8	8
Original NIST Speech	Energy VAD					8
	EMD VAD					8

The relative improvement in verification performance of the proposed system over the baseline system is determined as described in Section 4.6 under degraded conditions at SNR levels from 0 to 20 dB for different types of noises, as shown in Table 2.

Table 2

Performance of SV in Terms of EER_R(%) with Energy VAD and Proposed EMD VAD as the Front-End Signal Processing System.

Noise type	Relative performance improvement
	EER_R(%) at different SNR levels
	0 dB	5 dB	10 dB	15 dB	20 dB
White	05.00	57.14	68.42	70.59	75.00
Hfchannel	14.29	50.00	55.00	44.44	64.71
Pink	17.39	30.00	66.66	70.59	81.25
Volvo	41.18	64.71	76.47	80.00	80.00
Buccaneer 1	02.38	15.79	44.44	70.59	70.59
Destroy Engine	05.00	55.00	66.66	73.33	73.33
Original NIST speech	0

6 Conclusion and Future Works

In this study, an adaptive VAD algorithm using ZFFPR and EMD is integrated into the front end of an SV system. This proposed SV system uses a 39-dimension augmented MFCC feature vector, i.e., 13-dimension MFCCs augmented with 13-dimension velocity and 13-dimension acceleration coefficients and GMM–UBM technique for speaker modeling. The proposed SV system showed encouraging results under simulated degraded conditions when compared with the baseline SV system that uses energy-based VAD at its front end. The relative performance improvement of the adaptive EMD-based VAD showed that it is better than the energy-based VAD for SV tasks. This performance improvement is mostly attributed to an adaptive data-driven philosophy, the noise-filtering abilities of EMD, and the noise robustness of ZFF, which accurately determines the fundamental frequency under degraded conditions. The strategy for the detection and selection of physically meaningful IMFs, i.e., CIMFs that preserve source excitation information using ZFFPR, is relatively new of its kind.

An adaptive data-driven VAD algorithm using ZFFPR and EMD is a relatively new technique. The quantitative analysis and optimization of an adaptive VAD using ZFFPR in the EMD space may further improve the performance of an SV system under degraded conditions. Furthermore, it should be noted that the detected voiced regions under degraded conditions using adaptive EMD-based VAD is not free from degradations. Using an appropriate noise reduction technique to remove the noise from the degraded voiced regions before feature extraction may further enhance the quality of the extracted voiced regions and may improve the performance of the proposed SV system. Future work should focus on high-SNR and low-SNR region segmentation in speech utterances using EMD for further improvement in the performance of an SV system.

Corresponding author: M. S. Rudramurthy, Department of Information Science and Engineering, S.I.T., Tumkur 572 103, Karnataka State, India, e-mail: rudrams@yahoo.com

Bibliography

[1] B. S. Atal, Automatic recognition of speakers from their voices, Proc. IEEE. 64 (1976), 460–475.10.1109/PROC.1976.10155Search in Google Scholar

[2] S. G. Aukhun Tanyer and H. Auzer, Voice activity detection in non-stationary noise, IEEE Trans. Speech Audio Process.8 (2000), 478–482.10.1109/89.848229Search in Google Scholar

[3] L. E. Baum and T. Petrie, Statistical inference for probabilistic functions of finite state Markov chains, Ann. Math. Stat.37 (1966), 1554–1563.10.1214/aoms/1177699147Search in Google Scholar

[4] F. Bimbot, J. F. Bonastre, C. Frredouille, G. Gravier, I. Magrin Chagnollea, S. Meigner, T. Merlin, J. Ortega-Garcia, D. Petrovska-Delacretaz and D. A. Reynolds, A tutorial on text independent speaker verification, EURASIP J. Appl. Signal Process.2004 (2004), 430–451.Search in Google Scholar

[5] K. Bullington and J. M. Fraser, Engineering aspects of TASI, Bell Syst. Tech. J.38 (1959), 353–364.10.1002/j.1538-7305.1959.tb03892.xSearch in Google Scholar

[6] J. P. Campbell, Jr., Speaker recognition: a tutorial, Proc. IEEE85 (1976), 1437–1462.10.1109/5.628714Search in Google Scholar

[7] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process.ASSP-28 (1980), 357–366.10.1109/TASSP.1980.1163420Search in Google Scholar

[8] ETSI EN 301708 recommendation, Voice activity detector (VAD) for adaptive multi-rate (AMR) speech traffic channels, 1999.Search in Google Scholar

[9] M. Faundez-Zanuy and E. Monte Moreno, State-of-the-art in speaker recognition, IEEE A&E Syst. Mag.20 (2005), 7–12.10.1109/MAES.2005.1432568Search in Google Scholar

[10] M. Feldman, Hilbert transform applications in mechanical vibration, John Wiley & Sons Ltd, UK, 2011.10.1002/9781119991656Search in Google Scholar

[11] P. Flandrin, G. Rilling and P. Goncalves, Empirical mode decomposition as a filter bank, IEEE Signal Process. Lett.11 (2004), 112–114.10.1109/LSP.2003.821662Search in Google Scholar

[12] D. K. Freeman, G. Cosier, C. B. Southcott and I. Boyd, The voice activity detector for the PAN-European digital cellular mobile telephone service, Int. Conf. Acoust. Speech Signal Process.1 (1989), 369–372.Search in Google Scholar

[13] D. K. Freeman, G. Cosier, C. B. Southcott and I. Boyd, A statistical model-based voice activity detection, IEEE Signal Process. Lett.6 (1999), 1–3.10.1109/97.736233Search in Google Scholar

[14] S. Furui, Recent advances in speaker recognition, Pattern Recognit. Lett.18 (1997), 859–872.10.1016/S0167-8655(97)00073-1Search in Google Scholar

[15] S. Furui, 50 Years of progress in speech and speaker recognition research, ECTI Trans. Comput. Inf. Technol.1 (2005), 64–74.10.37936/ecti-cit.200512.51834Search in Google Scholar

[16] H. Gish and M. Schmidt, Text independent speaker identification, IEEE Signal Process. Mag.11 (1994), 18–32.10.1109/79.317924Search in Google Scholar

[17] S. L. Hahn, Hilbert transform in signal processing, Artech House, Boston, p. 442, 1996.Search in Google Scholar

[18] J. A. Haigh and J. S. Mason, Robust voice activity detection using cepstral features, in: Proceedings of the IEEE Region 10 Conference TENCON, pp. 321–324, 1993.Search in Google Scholar

[19] S. Haykin and L. Li, Nonlinear adaptive prediction of nonstationary signal, IEEE Trans. Signal Process.43 (1995), 526–535.10.1109/78.348134Search in Google Scholar

[20] N. E. Huang and S. S. P. Shen, The Hilbert Huang transform and its applications, Interdisciplinary Mathematical Sciences, vol. 5, World Scientific Co. Pvt. Ltd., Singapore, 2005.10.1142/5862Search in Google Scholar

[21] N. E. Huang, Nonlinear evolution of water waves: Hilbert’s view, in: Proc. Int. Symp. Experimental Chaos, 2nd ed., W. Ditto et al., eds., pp. 327–341, World Scientific, Scotland, UK, 1995.Search in Google Scholar

[22] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N.-C. Yen, C. C. Tung and H. H. Liu, The empirical mode decomposition and Hilbert spectrum for nonlinear and nonstationary time series analysis, Proc. R. Soc. Lond. A454 (1998), 903–995.10.1098/rspa.1998.0193Search in Google Scholar

[23] N. E. Huang, Z. Shen and S. R. Long, A new view of nonlinear water waves: the Hilbert spectrum, Annu. Rev. Fluid Mech.31 (1999), 417–457.10.1146/annurev.fluid.31.1.417Search in Google Scholar

[24] ITU-T Recommendation G.729-Annex B, A silence compression scheme for G.729 optimized for terminals conforming to recommendation V.70, 1996.Search in Google Scholar

[25] L. Karray and A. Martin, Towards improving speech detection robustness for speech recognition in adverse environment, Speech Commun.40 (2003), 261–276.10.1016/S0167-6393(02)00066-3Search in Google Scholar

[26] D. N. Kaslovsky and F. G. Meyer, Noise corruption of empirical mode decomposition and its effect on instantaneous frequency, Adv. Adapt. Data Anal.2 (2010), 373–396.10.1142/S1793536910000537Search in Google Scholar

[27] T. Kinnunen and H. Li, An overview of text-independent speaker recognition: from features to supervectors, Speech Commun.52 (2010), 12–42.10.1016/j.specom.2009.08.009Search in Google Scholar

[28] K. Li, M. Swamy and M. O. Ahmad, An improved voice activity detection using higher order statistics, IEEE Trans. Speech Audio Process.13 (2005), 965–974.10.1109/TSA.2005.851955Search in Google Scholar

[29] S. R. Mahadeva Prasanna and G. Pradhan, Significance of vowel-like regions for speaker verification under degraded conditions, IEEE Trans. Audio Speech Lang. Process.19 (2011), 2552–2565.10.1109/TASL.2011.2155061Search in Google Scholar

[30] S. R. Mahadeva Prasanna and G. Pradhan, Speaker verification by vowel and non-vowel like segmentation, IEEE Trans. Audio Speech Lang. Process.21 (2013), 854–867.10.1109/TASL.2013.2238529Search in Google Scholar

[31] R. J. Mammone, X. Zhang and R. P. Ramachandran, Robust speaker recognition: a feature based approach, IEEE Signal Process. Mag.13 (1996), 58–71.10.1109/79.536825Search in Google Scholar

[32] A. Martin, G. Doddington, T. Kamm, M. Ordowski and M. Przybocki, The DET curve in assessment of detection task performance, in: Eurospeech 1997, pp. 1–8, 1997.Search in Google Scholar

[33] M. Marzinzik and B. Kollmeier, Speech pause detection for noise spectrum estimation by tracking power envelope dynamics, IEEE Trans. Speech Audio Process., 10 (2002), 109–118.10.1109/89.985548Search in Google Scholar

[34] G. Mattingly, A. M. Uberman, A. K. Syrdal and T. Halwes, The discrimination of speech and non-speech stimuli in early infancy, Cognit. Psychol.1 (1971), 131–157.10.1016/0010-0285(71)90006-5Search in Google Scholar

[35] J. Mehler, J. Bertoncine and M. Barriere, Infant recognition of mother’s voice, Perception7 (1978), 491–497.10.1068/p070491Search in Google Scholar

[36] M. Miles and E. Meluish, Recognition of mothers voice in early infancy, Nature252 (1974), 123–124.10.1038/252123a0Search in Google Scholar

[37] P. A. Morse, The discrimination of speech and non-speech stimuli in early infancy, J. Exp. Child Psychol.14 (1972), 477–492.10.1016/0022-0965(72)90066-5Search in Google Scholar

[38] K. S. R. Murty and B. Yegnanarayana, Epoch extraction from speech signals, in: IEEE Trans. Audio Speech Lang. Process.16 (2008), 1602–1613.10.1109/TASL.2008.2004526Search in Google Scholar

[39] M. Nalina, M. S. Rudramurthy and R. Kumaraswamy, EMD based VAD as preprocessing for speech recognition in noisy environment, in: National Conference on Recent Advances in Electronics and Communication Engineering (NCRAECE-13), pp. 344–348, 2013.Search in Google Scholar

[40] S. J. Orfanidis, Introduction to signal processing, Prentice Hall, International Edition, Upper Saddle River, NJ, 1995.Search in Google Scholar

[41] R. D. Peacocke and D. H. Graf, An introduction to speech and speaker recognition, IEEE Comput.23 (1990), 26–33.10.1109/2.56868Search in Google Scholar

[42] G. Pradhan and S. R. M. Prasanna, Speaker verification under degraded condition: a perceptual study, Int. J. Speech Technol.14 (2011), 405–417.10.1007/s10772-011-9120-6Search in Google Scholar

[43] G. Pradhan, B. C. Haris, S. R. M. Prasanna and R. Sinha, Speaker verification in sensor and acoustic environment mismatch conditions, Int. J. Speech Technol.15 (2012), 381–392.10.1007/s10772-012-9159-zSearch in Google Scholar

[44] L. R. Rabiner and M. R. Sambur, Voiced–unvoiced–silence detection using Itakura LPC distance measure, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 323–326, 1977.Search in Google Scholar

[45] J. Ramirez, J. C. Segura, C. Benitez, A. de la Torre and A. Rubio, Efficient voice activity detection algorithms using long-term speech information, Speech Commun.42 (2004), 271–287.10.1016/j.specom.2003.10.002Search in Google Scholar

[46] D. A. Reynolds, An overview of automatic speaker recognition technology, in: Proceedings of ICASSP 2002, Orlando, Florida, 2002, pp. 4072–4075, 2002.Search in Google Scholar

[47] D. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Commun. 17 (1995), 91–108.10.1016/0167-6393(95)00009-DSearch in Google Scholar

[48] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Process.10 (2000), 19–41.10.1006/dspr.1999.0361Search in Google Scholar

[49] G. Rilling and P. Flandrin, One or two frequencies? The empirical mode decomposition answers, IEEE Trans. Signal Process.56 (2008), 85–95.10.1109/TSP.2007.906771Search in Google Scholar

[50] A. E. Rosenberg, Automatic speaker verification: a review, Proc. IEEE64 (1976), 475–487.10.1109/PROC.1976.10156Search in Google Scholar

[51] A. E. Rosenberg and F. K. Snoog, Recent research in automatic speaker recognition, in: Advances in Speech Signal Processing, S. Furui and M. M. Sondhi (ed.), Marcel Dekker, New York, pp. 701–738, 1991.Search in Google Scholar

[52] M. S. Rudramurthy, V. Kamakshi Prasad and R. Kumaraswamy, Voice activity detection algorithm using zero frequency filter assisted peaking resonator and empirical mode decomposition, in: Proceedings in International Conference on Communication, VLSI and Signal Processing (ICCVSP), pp. 38–43, 2013.10.1515/jisys-2013-0036Search in Google Scholar

[53] M. S. Rudramurthy, V. Kamakshi Prasad and R. Kumaraswamy, Voice activity detection algorithm using zero frequency filter assisted peaking resonator and empirical mode decomposition, J. Intelligent Systems22 (2013), 269–282.10.1515/jisys-2013-0036Search in Google Scholar

[54] A. Sangwan, M. C. Chiranth, H. S. Jamadagni, R. Sah, R. V. Prasad and V. Gaurav, VAD techniques for real time speech transmission on the Internet, in: IEEE International Conference on High-Speed Networks and Multimedia Communication, pp. 46–50, 2002.Search in Google Scholar

[55] F. K. Soong and A. E. Rosenberg, On the use of instantaneous and transitional spectral information in speaker recognition, IEEE Trans. Acoust. Speech Signal Process.36 (1988), 871–879.10.1109/29.1598Search in Google Scholar

[56] S. E. Tranter and D. A. Reynolds, An overview of automatic speaker diarization systems, IEEE Trans. Audio Speech Lang. Process.14 (2006), 1557–1565.10.1109/TASL.2006.878256Search in Google Scholar

[57] R. Tucker, Voice activity detection using a periodicity measure, Proc. Inst. Elect. Eng.139 (1992), 377–380.10.1049/ip-i-2.1992.0052Search in Google Scholar

[58] E. Verteletskaya and K. Sakhnov, Voice activity detection for speech enhancement applications, Acta Polytech.50 (2010), 100–105.10.14311/1251Search in Google Scholar

[59] Z. Wu and N. E. Huang, A study of the characteristics of white noise using the empirical mode decomposition method, Proc. R. Soc. Lond. A460 (2004), 1597–1611.10.1098/rspa.2003.1221Search in Google Scholar

[60] Y. Xu and H. Z. Zhang, Recent mathematical developments on empirical mode decomposition, Adv. Adapt. Data Anal.1 (2009), 681–702.10.1142/S1793536909000242Search in Google Scholar

Received: 2013-10-30

Published Online: 2014-2-10

Published in Print: 2014-12-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Speaker Verification Under Degraded Conditions Using Empirical Mode Decomposition Based Voice Activity Detection Algorithm

Abstract

1 Introduction

2 Empirical Mode Decomposition (EMD)

3 Adaptive VAD Algorithm Using ZFFPR and EMD

3.1 Zero-Frequency Filter

3.2 Peaking Resonator (PR)

3.3 Zero-Frequency Filter Assisted Peaking Resonator (ZFFPR)

3.4 Design Framework for the VAD Algorithm Using ZFFPR and EMD

4 SV Using GMM–UBM

4.1 Database

4.2 Block Diagram of the SV System

4.3 Voice Activity Region Extraction Using Energy-Based VAD and EMD-Based VAD

4.4 Feature Extraction

4.5 Speaker Modeling

4.6 Performance Comparison

5 Experiment, Results, and Discussion

6 Conclusion and Future Works

Bibliography

Journal and Issue

Articles in the same Issue