Speaker Identification Using Empirical Mode Decomposition-Based Voice Activity Detection Algorithm under Realistic Conditions

M.S. Rudramurthy; Nilabh Kumar Pathak; V. Kamakshi Prasad; R. Kumaraswamy

doi:10.1515/jisys-2013-0089

Open Access Published by De Gruyter April 2, 2014

Speaker Identification Using Empirical Mode Decomposition-Based Voice Activity Detection Algorithm under Realistic Conditions

M.S. Rudramurthy , Nilabh Kumar Pathak , V. Kamakshi Prasad and R. Kumaraswamy

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2013-0089

Abstract

Speaker recognition (SR) under mismatched conditions is a challenging task. Speech signal is nonlinear and nonstationary, and therefore, difficult to analyze under realistic conditions. Also, in real conditions, the nature of the noise present in speech data is not known a priori. In such cases, the performance of speaker identification (SI) or speaker verification (SV) degrades considerably under realistic conditions. Any SR system uses a voice activity detector (VAD) as the front-end subsystem of the whole system. The performance of most VADs deteriorates at the front end of the SR task or system under degraded conditions or in realistic conditions where noise plays a major role. Recently, speech data analysis and processing using Norden E. Huang’s empirical mode decomposition (EMD) combined with Hilbert transform, commonly referred to as Hilbert–Huang transform (HHT), has become an emerging trend. EMD is an a posteriori, adaptive, data analysis tool used in time domain that is widely accepted by the research community. Recently, speech data analysis and speech data processing for speech recognition and SR tasks using EMD have been increasing. EMD-based VAD has become an important adaptive subsystem of the SR system that mostly mitigates the effect of mismatch between the training and the testing phase. Recently, we have developed a VAD algorithm using a zero-frequency filter-assisted peaking resonator (ZFFPR) and EMD. In this article, the efficacy of an EMD-based VAD algorithm is studied at the front end of a text-independent language-independent SI task for the speaker’s data collected in three languages at five different places, such as home, street, laboratory, college campus, and restaurant, under realistic conditions using EDIROL-R09 HR, a 24-bit wav/MP3 recorder. The performance of this proposed SI task is compared against the traditional energy-based VAD in terms of percentage identification rate. In both cases, widely accepted Mel frequency cepstral coefficients are computed by employing frame processing (20-ms frame size and 10-ms frame shift) from the extracted voiced speech regions using the respective VAD techniques from the realistic speech utterances, and are used as a feature vector for speaker modeling using popular Gaussian mixture models. The experimental results showed that the proposed SI task with the VAD algorithm using ZFFPR and EMD at its front end performs better than the SI task with short-term energy-based VAD when used at its front end, and is somewhat encouraging.

Keywords: Speaker identification; empirical mode decomposition; voice activity detection; zero frequency filter; Gaussian mixture modeling

1 Introduction

Recognition of a speaker by using the intrinsic characteristics of his/her voice is an example of a biometric task. The key motivation behind the study of speaker recognition (SR) is to ensure more reliable personal identification based on the speaker’s voice. The art of identifying people based on their voice characteristics is of paramount importance owing to the growing need in information processing, telecommunications, and more particularly true for security applications such as physical access control, computer data access control, forensic, military, etc. The key advantage of using biometrics is that it is more reliable than conventional artifacts, perhaps even unique; moreover, biometric attributes cannot be lost or forgotten and thus need not be remembered. SR is a generic term that refers to any task that discriminates between people based on their voice characteristics [21]. The SR task is basically categorized into two specific tasks: (i) speaker identification (SI) and (ii) speaker verification (SV). In SI, the task is to classify an unlabeled voice token as belonging to one of a set of n reference speakers (i.e., one-to-many matching task), whereas SV refers to the task of deciding whether an unlabeled voice token belongs to a specific reference speaker with two possible outcomes that the token is either accepted or rejected [16, 21]. The SI task is further categorized as text-dependent SI and text-independent SI tasks. In a text-dependent SI task, the same text is used for both training and testing, whereas in a text-independent SI task, the text used for training and testing is not the same. However, in both cases, generally, in text in speech utterances used for training and testing, a particular language is maintained. Furthermore, in SI, most of the computation originates from the distance or likelihood computations between the feature vectors of the unknown speaker, and the models in the database and the identification time depend on the number of feature vectors, their dimensionality, the complexity of the speaker models, and the number of speakers [38]. The study of speech in the context of speech recognition and SR has a history of >60 years [26]. There are several tutorial [16], survey [26, 50], and overview [4, 23, 27, 65] reports that describe state-of-the-art SI and the current challenges.

Most state-of-the-art SI systems provided higher reliability and accuracy when trained and tested with speech utterances collected from speakers in an acoustically controlled environment. However, reliability and higher accuracy are difficult to meet in SR tasks when considered in practical applications under unconstrained conditions. Real-world SI task application differs from ideal or laboratory conditions causing perturbation that leads to a mismatch between training and testing environment and degrades the performance drastically [62]. Therefore, noise-robust SR becomes an important research topic nowadays.

The major hurdles in achieving reliability and higher accuracy in practical SI tasks are due to various factors such as handset/channel mismatch and environmental noise. Therefore, it is desirable that any SI task should be robust against noise, has intra-speaker variability, and is independent of the text and language used by the speakers during training and testing. Among these, environmental noise and its impact on the reliability and accuracy of the SI task need to be focused. Owing to the mobile nature of such practical SI tasks/systems, the noise sources can be highly time-varying and potentially unknown, which raises the requirement for noise robustness in the absence of information about the noise a priori [45]. Generally, SI tasks rely on a similarity measure across the set of voice recordings, Currently, it is not possible to completely determine whether the similarity between two recordings is due to the speaker or to other factors, especially when (i) the speaker does not cooperate, (ii) there is no control over the recording equipment, (iii) recording conditions are not known, (iv) one does not know whether the voice was disguised, and, to a lesser extent, (v) the linguistic content of the message is not controlled. Caution and judgment must be exercised when applying SR techniques, whether human or automatic, to account for these uncontrolled factors [13]. To accomplish noise robustness, i.e., to overcome a mismatch condition between training and testing sessions in SR tasks, there exists many speech enhancement methods such as spectral subtraction [12], Wiener filters [10], Kalman filtering [52, 64], etc., as preprocessing methods at the front end to mitigate the effects of stationary noises, and many feature postprocessing methods, such as histogram equalization [20], cepstral mean subtraction [25], and cepstral variance normalization [66], which mainly focus to convert raw speech features extracted to a form less vulnerable against the noise corruption under adverse environment [71]. Most filtering techniques assume stationary noise and require a priori knowledge of the noise spectrum that may not be adequate under realistic conditions owing to the nonstationary nature of noise and speech. However, reality never bends before our assumptions.

In reality, most data like speech are nonlinear, nonstationary, and multicomponent in nature. Also, under realistic conditions, noise is also mostly nonstationary and degrades the performance of SR tasks. Any nonlinear nonstationary data analysis in real time has to be adaptive and data driven without any a priori assumptions. Recently, empirical mode decomposition (EMD), an adaptive, a posteriori-based, data-driven decomposition technique in time domain to analyze the nonlinear nonstationary data [37], has become available. Speech data analysis and processing using EMD [1, 28–31, 70] is an emerging field. Both SR tasks, such as SI and SV tasks, use voice activity detectors (VADs) as the front-end component. The performance of VAD is mostly affected by the presence of noise in speech data that will, in turn, degrade the performance of SR tasks. Recently, an adaptive VAD algorithm using zero-frequency filter-assisted peaking resonator (ZFFPR) and EMD has been investigated in Ref. [61]. The efficacy of an EMD-based VAD as preprocessing for speech recognition in a noisy environment with hidden Markov modeling is studied in Ref. [49], with focus on the effect of white noise and high-frequency channel noise at the signal-to-noise ratio (SNR) level range from 0 to 30 dB with the TIDIGIT database. The performance of this VAD is compared against the traditional energy-based VAD. This method has provided some encouraging results and performance improvement in word accuracy. For example, for white noise and hfchannel noise at the SNR level of 10 dB, this method provided a percentage word accuracy of 70.26 and 61.06, respectively, against a percentage word accuracy of 32.93 and 24.11, respectively, for traditional energy-based VAD.

After being motivated by the above results, we studied the efficacy of the adaptive VAD algorithm using ZFFPR and EMD in a text-independent language-independent SI scenario under realistic conditions. The novelty of this work is that an attempt is made to develop a text-independent and language-independent SI system with EMD-based VAD as its front end for a realistic database consisting of speech utterances from speakers in three different languages, collected at five different places under a realistic environment. For the initial purpose of this study, only 30 speakers (14 men and 16 women) are anticipated.

This article is organized as follows: Section 2 describes EMD algorithmic issues; Section 3 describes EMD-based VAD; Section 4 describes the components of the SI system; and Section 5 provides the description of experimental setup, results, and discussion.

2 Empirical Mode Decomposition

The core activity in scientific research is data analysis. Data analysis and data processing are two different activities. Data analysis is most often ignored or neglected in the past owing to a lack of an appropriate analysis method, which, in turn, results in data processing methods taking over this task thus far. Earlier data processing methods are developed with strict mathematical theories and rules. On the limitations of traditional data processing methods, its strict adherence to mathematical rigor is described in Ref. [22, p. 28]. Most real-world systems, such as geophysical systems, oceanic systems, and biological systems like speech production systems, are neither linear nor stationary. The behavior of nonlinear systems is difficult to understand and analyze. Furthermore, excitation to these systems cannot be clearly ascertained, and accurate mathematical representation and modeling of complex nonlinear systems is also difficult. In such circumstances, data are the only link that can reveal the behavior of such complex nonlinear systems in the real world. Therefore, data analysis is crucial rather than data processing. The key goal of any data analysis method is to understand the underlying physical or physiological mechanisms that generate the data. Therefore, any data analysis method has to be adaptive in the time domain, should not heavily rely on mathematical rigor, and should not make any a priori assumptions on data that are linear and stationary. Fortunately, the EMD algorithm is of that kind, investigated by Norden E. Huang in Ref. [37], and today, it is considered a potential tool for the analysis of nonlinear nonstationary data such as speech. Since its investigation, it has found applications in almost all areas of science and engineering, such as ocean studies [7], atmospheric studies [73, 74], geophysical studies [33], fluid studies [35, 36], financial and economic studies [18, 67, 72], biomedical engineering [5, 6, 44, 53–55, 75, 77], earthquake engineering [69, 56], and structural health monitoring [17, 32, 34].

The speech production system is a complex nonlinear system that produces nonstationary multicomponent speech difficult to understand and model accurately. Speech is the only link that we have with the nature in which all the information about the underlying mechanism is embedded. Recently, speech data analysis and subsequent processing using EMD is rapidly increasing in the area of speech processing applications, such as speech enhancement [14], voiced and unvoiced speech classification [46, 47], speech denoising [39], pitch determination [31], speech recognition [71], and SR [28, 41, 70]. EMD, since its exploration, has been used in speech recognition with considerable benefits. However, its merit needs to be proved in the area of SR, as its use is new to the field [9], and hence the motivation for this study. EMD combined with Hilbert spectral analysis became a novel tool for analysis of all types of data, to which the National Aeronautics and Space Administration gave the name Hilbert–Huang transform (HHT) technology.

The EMD iteratively decomposes any nonlinear nonstationary data like speech into a set of discrete modes of oscillations. Each of the discrete modes of oscillations is a zero mean component (narrowband) referred to as an intrinsic mode function (IMF) that satisfies the following two criteria:

The number of extrema and the number of zero crossings are either equal to each other or differ by at most one.
At any point, the mean value between the envelope defined by local maxima and the envelope defined by the local minima is zero.

The EMD of a signal x(t) is based on the following observations [76]:

The signal has at least two extrema, i.e., one maximum and one minimum.
The characteristic time scale is clearly defined by the time lapse between successive alternations of local maxima and minima of the signal.
If the signal has no extrema but contains inflection points, then it can be differentiated one or more times to reveal the extrema.

To extract the set of IMFs from the original signal, x(t), the following procedure, called the sifting procedure, the core of the EMD algorithm, is carried out:

Detect and extract all maxima and minima points of the signal and interpolate between them to determine the upper and lower envelopes, E_upper and E_lower, respectively.
Using these envelopes, calculate the local mean, m(t) as
(1)m(t) = Eupper + Elower2, (1)
where E_upper and E_lower represent the upper and lower envelope, respectively.
Subtract this mean m(t) from the original and use the result as the new signal
(2)h(t) = x(t) − m(t). (2)
If h(t) does not match the criteria of an IMF, then the procedure is iterated at step 1, which is a sifting operation with the new input h(t), and then skip steps 4 and 5.
If h(t) matches the criteria of an IMF, it is stored as an IMF, c_i(t)
(3)ci(t) = h(t), (3)
and subtracted from the original signal to get the residual
(4)r(t) = x(t) − ci(t), (4)
where i refers to the i^th IMF.
Begin from step 1, with the new signal r(t), and store c_i(t) as an IMF.

Finally, the procedure will terminate when the residual r(t) becomes a monotonic function, called the signal’s trend, from which no further IMFs can be extracted. It is a continuously increasing or decreasing function with the number of extrema being less than two.

Now, the original signal x(t) can be reconstructed using components obtained from EMD of x(t) as follows:

(5)x(t) = ∑J = 1nci(t) + rn(t), (5)

where c_i(t) is the i^th IMF and r_n(t) is the monotonic function. The strict definition of IMF may result in the requirement for excessive numbers of sifting iterations to extract the IMF, which may lack physical sense. To guarantee that the IMF components retain enough physical sense of both amplitude and frequency modulations, there is a need to determine a criterion for the sifting process to stop. This can be accomplished by limiting the size of the standard deviation (SD), computed from the two consecutive sifting results [37] as

(6)SD = ∑t = 0T[|(h1(k − 1)(t) − h1k(t))|2h1(k − 1)2(t)]. (6)

For most practical purposes, SD is chosen to be 0.2–0.3, and number of iterations required for the sifting procedure to yield IMFs that are physically meaningful is chosen to be 10–15 [32, 37, 68].

Each IMF is a narrowband monocomponent that very well preserves all the intrinsic characteristics of the original data (i.e., nonlinear nonstationary nature of the original data). Then, the Hilbert transform is applied to calculate the instantaneous frequencies of the original signal. Now, the task is shifted to how to compute the instantaneous frequency from the real valued signal. The instantaneous frequency can be computed by representing the signal in an analytic method by using the Hilbert transform:

(7)z(t) = ci(t) + iH[ci(t)], (7)

where H[.] is the Hilbert transform. From the above, instantaneous amplitude a(t) and instantaneous frequency f(t) be computed as follows:

Instantaneous amplitude a(t)

(8)a(t) = (ci2 + H[ci(t)]2)12, (8)

and instantaneous phase θ(t)

(9)θ(t) = arctan(ci(t)H(ci(t))), (9)

then the derivative of the instantaneous phase provides the instantaneous frequency ω(t) given by

(10)ω(t) = dθ(t)dt. (10)

The instantaneous frequency f(t) can be defined as

(11)f(t) = 12πdθ(t)dt, (11)

in terms of the derivative of phase θ(t). The discrete time instantaneous frequency ω(n) is computed by a central difference scheme as

(12)ω(n) = 12πθ(n + 1) − θ(n − 1)2T, (12)

where T is the time interval. Thus, a given time n corresponds to a frequency ω(n) and amplitude a(n). Thus, on the (n, ω) plane, each point corresponds to amplitude that is a function of both time n and frequency ω; however, time n and frequency ω are not independent but are related by a function ω(n). The triplet (n, ω(n), a(n)) determine a point in three-dimensional (n, ω, a) space. For a given n, find a point ω(n), hence a point on the (n, ω) plane. Once can find this a(n) for all IMFs and hence for many amplitudes on the (n, ω)-plane. These amplitudes form the discrete Hilbert spectra referred to as the Hilbert amplitude spectrum. The differentiation of phase θ(t) yields the instantaneous frequency f(t).

The underlying HHT of the signal is mathematically defined [37] as

(13)HHT(t, ω) = ∑i = 1nHHTi(t, ω) ≡ ∑i = 1nai(t, ω), (13)

where HHT_i(t, ω) represents the time–frequency distribution obtained from the i^th IMF of the signal. The symbol ≡ denotes “by definition,” and a_i(t, ω_i) combines the amplitude a_i(t) and instantaneous frequency ω_i(t) of the signal together.

The EMD of a typical real-time speech utterance “car” recorded for a male speaker is shown in Figure 1.

Figure 1

Illustration of the EMD of the Real-World Speech Utterance “Car.”

EMD provides true physically meaningful representation of speech data. This is because EMD makes it is possible to visualize the different discrete mode of oscillations embedded in the original speech data. EMD determines whether the data contain one or two frequencies provided the components differ in frequencies substantially [60]. This ability is significantly important in determining the source and system component from speech data [61], as described in Section 3. EMD effectively acts as a dyadic filter bank, a collection of band-pass filters that have a constant band-pass shape (e.g., a Gaussian distribution) but with neighboring filters covering half or double the frequency range of any single filter in the bank. The frequency ranges of the filters can be overlapped [68]. Owing to the dyadic filter property, it is capable of reducing white noise and fractional Gaussian noise [24]. The ability of EMD to filter white noise and fractional Gaussian noise is significantly important at the front end of speech recognition and SR systems to alleviate the problem of mismatch between training and testing sessions.

Furthermore, the resolution of the HHT time–frequency spectrum (i.e., time and frequency), as given in eq. (13), is excellent compared with the short-time Fourier transform (STFT) spectrum. For example, consider the utterance made by a male speaker, “She had your dark suit,” from the TIMIT database. The STFT and HHT spectra for this speech utterance are shown in Figure 2.

Figure 2

(A) STFT Spectrum and (B) HHT Spectrum for the Male Speech Utterance “She Had Your Dark Suit” Chosen from the TIMIT Database.

Figure 2B clearly illustrates the superior resolution of the HHT spectrum compared with the STFT spectrum. This fact may be significantly important for source and filter component separation using HHT in an adaptive manner, and it is one of the motivational facts for the development of an EMD-based voice activity detection algorithm using ZFFPR and EMD in Ref. [61].

3 EMD-Based Voice Activity Detection Algorithm

The ability of EMD to decompose nonlinear nonstationary multicomponent speech into a set of discrete mode of oscillations, which are narrowband zero mean components, is of great importance from a signal detection point of view. Detection of a specific component that truly represents the source excitation characteristics of the speech production system from speech data is a challenging task. Source excitation characteristics are mostly abundant in voiced regions compared with other parts of a speech utterance, such as silence, unvoiced speech, or noisy speech regions. Voiced regions are comparatively high-SNR regions and less degraded when compared with other parts of speech. Therefore, extraction of high-SNR voice activity regions under degraded conditions is a challenging and unsolved problem to date. Although many different types of VAD techniques are in practice, the performance of VAD at the front end of speech recognition and SR systems degrades under degraded conditions or uncontrolled situations, especially when the interfering noise is nonstationary, which, in turn, deteriorates the recognition performance of speech recognition and SR tasks.

It is well known that the frequency of vibration of vocal folds, called fundamental frequency or pitch, is much lower than the resonant frequencies of the filter components in speech production mechanisms. On the basis of this fact, the ability of EMD to decompose nonlinear nonstationary data adaptively into a set of IMFs is exploited in the development of a VAD algorithm using ZFFPR and EMD [61]. The block diagram of the VAD algorithm using ZFFPR and EMD is shown in Figure 3.

Figure 3

Block Diagram of a VAD Algorithm Using ZFFPR and EMD.

Speech data are decomposed into a set of IMFs and also zero-frequency filtered simultaneously. The significant excitation of the vocal tract system in a speech production process occurs at glottal closure instants [2, 3] called epochs. Epoch extraction and determination of fundamental frequency, i.e., pitch, from the knowledge of epoch using noise-robust zero-frequency filters (ZFF) [48] even under adverse conditions is well accepted in SR research. Among the set of IMFs obtained through EMD of speech, detection of IMFs that dominantly contain significant source excitation information is a challenging task. An adaptive framework that combines ZFF with the peaking resonator (PR) described in Ref. [51] in EMD space, capable of detecting the specific IMF among the set of IMFs obtained through EMD of speech, is a novel approach in signal detection practice. Each of the IMF is passed through the PR, which is resonated by the fundamental frequency determined by the ZFF. The energy of the PR-filtered IMF is computed each time the IMF is passed through the PR. The IMF that transfers maximum energy through the filter is called the characteristic IMF (CIMF), which is supposed to contain the significant source excitation information. The CIMF is then chosen for signal processing, i.e., block processing. The evidence obtained from this is used for developing VADs. To gain insight with regard to its efficacy to improve the recognition performance, this EMD-based VAD is integrated at the front end of the speech recognition scenario with the TIDIGIT database in Ref. [49] and compared against the baseline system by replacing the EMD-based VAD with traditional energy-based VAD. Both systems are studied under degraded conditions with white noise and hfchannel noise at various SNR levels from 0, 5, 10, 15, and 20 dB.

4 SI System

4.1 Block Diagram of the Proposed SI System

In SI task, speech from a human individual is used to identify who that individual is. The block diagram of the proposed SI task is shown in Figure 4.

Figure 4

Block Diagram of a Proposed Speaker Identification System [65].

The main components of the SI system are the

Data acquisition system,
Voice activity detection module,
Feature extraction,
Speaker modeling,
Speaker’s model database.

There are two distinct operational phases: (i) training phase (enrolment phase) and (ii) testing phase. The training or enrolment phase is a part of the system configuration before the system deployment in the field of application. In the enrolment phase, speech utterance from each of the verified speaker from the pool of a known speaker population is used to build or train the model. The testing phase constitutes the true operation of the system, in which the speech utterance from the unknown speaker who is not in the pool of the known speaker population but from the general population is compared with each of the trained speaker’s model, in a model database that is commonly referred to as open-set SI. Closed-set identification is different in that the unknown individual belongs to a preexisting pool or database of speakers (speaker models). The problem then becomes that of choosing which speaker from the pool the unknown speech is derived from. The closed-set SI task is commonly employed in an organizational setup with a fixed set of known speakers. Thus, the task of open-set identification is to determine whether the speaker belongs to the group of known speakers or not. The task is rejecting the speaker if he does not belong to the group of known speakers; otherwise, the closed-set SI task is performed. The performance of the SI system is usually measured in terms of the percentage of correct identification averaged across all speakers in the pool, referred to as the percentage identification rate (%IDR). The proposed system in this study, as shown in Figure 4, uses EMD-based VAD at its front end of the SI task. The baseline system is much similar to the proposed system except the EMD-based VAD is replaced by the traditional short-time energy-based VAD.

A data acquisition system consists of a signal condition circuit with a preamplifier that collects the speech signal from the sensor microphone and digitized using an analog-to-digital converter or a voice-coding module that samples the analog speech signal at the specified sampling rate or sampling frequency that is twice the highest frequency component present in the original speech, which is usually 8000 Hz. The digitized speech data are then input into the EMD-based VAD in the proposed SI task and short-time energy-based VAD in the baseline SI task, which extracts the voiced activity regions from the input speech utterances and distinguishes it from the unvoiced regions, silence regions and noisy regions. The extracted voiced regions are then further used in the feature extraction process. The extracted features are then used to build the speaker model using any of the various speaker modeling techniques such vector quantization (VQ) [40], learning vector quantization (LVQ) [15], and Gaussian mixture model (GMM) [57].

4.2 Feature Extraction

In the feature extraction process, each of the extracted time-domain voiced speech from the energy-based VAD in the baseline SI task or from the EMD-based VAD in the proposed SI task is divided into overlapping fixed duration segments called frames, and the process is called frame blocking. The length of the frame is called frame size. Usually, the frame size (in terms of sample points) is equal to a power of two in order to facilitate the use of fast Fourier transform (FFT). The duration of frame overlapping is called the frame shift. In our study, a frame duration of 20 ms and a frame shift of 10 ms are employed for the purpose of feature extraction. Cepstrum analysis, which was suggested by Bogert et al. in 1963, was used to process the reverberative signal [11]. The Mel frequency cepstral coefficients (MFCCs) introduced by Davis and Mermelstein [19] is the most popular acoustic feature extraction procedure widely accepted in speech recognition, SR, and audio analysis. MFCCs take human perception sensitivity with respect to frequencies into consideration, and therefore are best for speech recognition/SR. MFCCs provide a compact representation of the spectral envelope of the frame of speech. This accounts for the frame’s perceived timbre. Figure 4 shows the MFCC feature extraction procedure. To keep the continuity of the first and the last points in frame to prevent an undesirable effect in frequency response, each frame of speech is multiplied by the Hamming window

(14)w(n, a) = (1 − a) − a * cos[2πn(N − 1)], 0 ≤ n ≤ N − 1. (14)

In practice, the value of a is set to 0.46. With the use of the Hamming window, the peak is sharper and more distinct in the frequency response. The spectral analysis of speech reveals that different timbres in a speech signal correspond to different energy distributions over frequencies. This can be visualized in the magnitude spectrum of the Hamming windowed frame of speech by using FFT. While performing FFT of the windowed frame of speech, the speech signal within the windowed frame is assumed to be periodic and stationary. The magnitude spectrum is then squared to obtain the power spectrum or the short-time power spectral density (PSD) of the speech frame. The PSD of speech frame is then filtered using a series of M-overlapping triangular-shaped filters that are centered on the Mel scale, which is a nonlinear perceptually motivated frequency scale that approximates the frequency weighting of the human auditory system, derived from the perception experiments [63]. Typically, 24 filters are used for the range of 0–8 kHz.

The Mel scale is roughly linear below 1 kHz and then logarithmically spaced, meaning that the 24 Mel filters, if measured on a linear Hertz scale, become broader with increasing frequency. This corresponds well with the finding from psychoacoustics that the timbre corresponds with the relative level in each of the 27 critical bands, compared across filters in a process called profile analysis. Each critical band has a breadth of approximately one-third of an octave. The positions of these filters are equally spaced along the Mel frequency f_mel, which is related to the common linear natural frequency f_linear by the following equation:

(15)fmel = 2595 * [log10(1 + flinear700)]. (15)

The amplitude of each filter’s output is then measured by multiplying each spectral component with the height of the filter triangle at its position, and then adding up the weighted components. The resulting 24-dimensional vector is logarithmized and known as the log-filter bank energy vector, which still contains source and filter information. Then discrete cosine transform is applied on 24 log energy E_k obtained from 24 triangular band-pass filters to have 12 MFCCs. The energy within a frame is also important and can be obtained easily. Usually, energy augmented with 12 MFCCs provides 13-dimension MFCCs. To attain better recognition performance, in practice, time derivatives of (energy + MFCC) as new features, which show the velocity and acceleration of (energy + MFCC), are used. It provides a 39-dimension feature vector as shown in Figure 5.

Figure 5

MFCC Feature Extraction Procedure.

4.3 Speaker Modeling

The D-dimensional feature vector is extracted from each frame of speech utterance through the feature extraction procedure described in Section 4.2. If we assume the length of speech utterance M frames for the specific speaker k, then the feature extraction procedure yields x→t ∈ RD:1 ≤ t ≤ M. The statistical model will be the best candidate for modeling the speaker k for the given utterance from the speaker consisting of random sequences of time samples covering all possible spoken words. The GMM is a stochastic model widely used in text-independent SR tasks [58]. An important characteristic of GMM is that it aims at representing the mean, i.e., the distribution, and the variance, i.e., the scattering around the mean, of the feature vector in a multidimensional space and it assumes the distribution of data to be Gaussian. It adopts the multivariate Gaussian probability density for parameterization. Then, pattern matching is simply formulated as the measuring probability density (or the like-likelihood) of an observation vector given the speaker model. The likelihood of an input feature vector given any specific GMM is the weighted sum over the likelihoods of the M unimodal Gaussian densities. A specific GMM that represents the likelihood of an input feature vector is given by

(16)p(xi | λj) = ∑i = 1Mωib(xi | λj), (16)

where b(x_i∣λ_j) is the likelihood of x_i for the given model (λ) for j^th Gaussian mixture

(17)b(xi|λj) = 1(2π)D2|∑j|12e12[(xi − μj)T∑j−1(xi − μj)], (17)

where D is the dimension of the vector, and μ_j and Σ_j are the mean and covariance matrices of the training vectors, respectively. The sum of the mixture weights ω_j is one and constrained to be positive. The GMM model (λ) parameters ω_j, μ_j, and Σ_j are estimated from the training feature vectors using a maximum likelihood criterion through an expectation maximization (EM) algorithm [8, 19]. A key feature of the EM algorithm is that it can guarantee monotonic convergence to the set of optimal parameters (in the maximum-likelihood sense) in only a few (five or so) iterations [65]. The complete description of the GMM speaker modeling technique is provided in Refs. [57, 59].

5 Experiment, Results, and Discussion

5.1 Database

The speech database for the experiments was collected from 30 speakers. The database includes 14 male and 16 female speakers. All 30 speakers were recorded in English, Hindi, and Kannada. Voice recording was done in different locations, such as market place, college, nearby roads, home, and laboratory. The speakers were students, general public, and faculty members. The age of the speakers varied from 18 to 45 years. The speakers were asked to read smaller stories in three different languages. The training and testing data were recorded in different sessions with a minimum gap of 2 days. The approximate training and testing data length is 4 min. Recording was done using an Edirol R-09 HR electronic device. The free, downloadable, WaveSurfer 1.8.5 software was used for editing and analysis of speech files. The sampling rate was kept at 96 kHz in a two-channel and Lin 24 format. Audio files of high frequency, e.g., 96 kHz, can be dropped down to a lower rate for distribution without losing much original data. It still maintains good fidelity. Nevertheless, it takes large space to store the data. Edirol is preferred for the recording process owing to its unique features. R-09HR has an internal stereomicrophone; a USB 2.0 port; and 1/8-in. stereo jacks for line in, mic in (with plug-in power), and line out/headphone. R-09HR is a good, easy-to-use, general-purpose recorder that comes with a wireless remote. The speech files are stored in “.wav” format. The speech corpus design is in a similar format to TIMIT. For each session, five different folders created corresponding to each environmental region are shown in Table 1. Each individual speaker has their corresponding folder in respective environmental region folders that store their corresponding speech files. The speech files are separated according to sentences and their corresponding text files are also documented. Each speaker has 10 speech files from which eight are used for training and two for testing in the SI process. The experiments are conducted using different sizes of training and testing data to study the effectiveness of the SI system. The detailed specifications for collecting the database are shown in Table 1.

Table 1

Description of the Speech Corpus.

Item	Description
No. of speakers	30
Sessions	Training and testing
Sampling rate	96 kHz
Sampling format	Two-channel, Lin 24
Languages covered	English, Hindi, and Kannada
Device	Edirol R09-HR
Software	WaveSurfer 1.8.5
Maximum duration	150 s/story/language
Minimum duration	Depends on the speaker
Environments	Home, roads, market, college, laboratory
Ethnic background of speaker	Students, faculty members, general public

5.2 Experiment

The experiment in this study focuses on the development of a text-independent language-independent SI system under realistic conditions such street, home, laboratory, and college campus. This is an initial study that focuses on population size of 30 speakers (14 men and 16 women). The experience gained from this study is fruitful and motivational for further studies with a large population size under adverse conditions.

The experiment is carried out with a baseline SI task that incorporates traditional energy-based VAD with the database consisting of speech utterances recorded at a sampling rate of 96 kHz for 30 speakers (14 men and 16 women) in a realistic environment. There are 10 speech utterances per speaker collected in five different places in three different languages. Among 10 speech utterances, eight speech utterances are randomly chosen for the training session for each of the speaker and the remaining two speech utterances of each speaker are chosen for the testing phase. During the training phase, speech utterances are down-sampled to 8 kHz from 96 kHz. The down-sampled speech utterances are passed through energy-based VAD at the front end of the SI task, to extract the voiced segments of speech. The down-sampled voiced segments of speech are then processed using signal-processing techniques with a frame size of 20 ms and a frame shift of 10 ms to extract a 39-dimension feature vector from each speech frame using the most popular MFCC feature extraction procedure as described in Section 4.2. The extracted feature vectors for each of the speech utterance made by the speaker are used for modeling the speaker using the GMM modeling technique as described in Section 4.3 for different Gaussian mixture sizes of 16, 32, 64, 128, and 256. This procedure is repeated for each of the speaker’s eight speech utterances. The speaker’s model database is thus created. During the testing phase, the remaining two speech utterances of each of the 30 speakers are considered; similarly, the voiced segments of speech are extracted using energy-based VAD and feature vectors for each of the speech frame are extracted and compared against each of the speaker’s model in the database to determine the matching score to obtain the %IDR. This experiment is carried out for a 10-speaker set, 20-speaker set, and 30-speaker set separately and the percent IDR is determined.

A similar experiment is carried out with energy-based VAD replaced by the adaptive EMD-based VAD at the front end of SI task. The EMD-based VAD at the front end of the SI task is expected to provide superior performance when compared with the energy-based VAD, and thus to provide enhanced recognition performance by reducing the mismatch between training and testing. Using the knowledge of epochs, the instants of glottal closure at which significant excitation of vocal tract system takes place, to determine the fundamental frequency [3, 48] and detecting the specific IMF called CIMF among the set of IMFs obtained through the EMD of realistic speech data, to make decisions about voiced or unvoiced speech using ZFFPR [61], provided a novel approach under degraded conditions or realistic conditions. Furthermore, owing to the ability of EMD to separate the frequency components existing in multicomponent data like speech, as shown in Figure 1, the high-frequency system components reside in the first few high-order IMFs and the source excitation information mostly resides in the lower-order IMFs i.e., IMF4–IMF7. The CIMF that preserves most of the source excitation information mostly falls in the lower-order IMFs, the detection of which for VAD purposes enables us to carry out voiced/unvoiced decision using traditional signal-processing methods with a CIMF that is somewhat attractive and noise robust.

Figure 6 illustrates the performance of an adaptive VAD algorithm using ZFFPR and EMD for realistic speech data.

Figure 6

Performance of EMD-Based VAD for Realistic Speech.

(A) Original realistic speech. (B) Voiced–unvoiced classification. (C) Output of EMD-based VAD.

5.3 Result

The performance measure in terms of %IDR for the proposed SI task that employs adaptive EMD-based VAD and the baseline SI task that employs energy-based VAD in place of EMD-based VAD is shown in Table 2. For a 64-GMM speaker model, the proposed SI method with EMD-based VAD provided improved recognition performance over the energy-based VAD in the baseline system for the set of 10, 20, and 30 speakers consistently as shown in Table 2 in bold values. Also, for 128-GMM, the proposed method shows improvement over energy based VAD for speaker set of 20.

Table 2

Experimental results illustrating performance comparison between energy based VAD and EMD based VAD in a speaker identification scenario in terms of percentage identification rate (% IDR).

Speakers	Techniques	Gaussian mixtures
Speakers	Techniques	16	32	64	128	256
10	Energy-based VAD	37	45	53	45	45
	EMD-based VAD	35	45	56	45	40
20	Energy-based VAD	40	45	52	50	48
	EMD-based VAD	38	42	53	52	41
30	Energy-based VAD	37	38	47	48	37
	EMD-based VAD	30	33	50	47	37

The change in %IDR against the change in population size is shown in Figure 7. It is evident from these results that the VAD module at the front end of the SI task plays a significant role under degraded conditions or mismatched conditions. Furthermore, the result shows that the adaptive EMD-based VAD when employed at the front end of the SI task provides encouraging improvement in the recognition performance when compared with the energy-based VAD when employed in the SR scenario.

Figure 7

Performance Comparison of Energy-Based VAD and EMD-Based VAD in Terms of %IDR in a Text-Independent Language-Independent SI Scenario Under Realistic Conditions.

5.4 Discussion

In this article, a text-independent language-independent (multilingual text-independent) SI task is proposed with EMD-based VAD at its front end under realistic conditions or mismatched conditions. The performance of the proposed task is compared with the baseline SI task, which in energy-based VAD is used at the front end. The improvement in recognition performance when EMD-based VAD was employed at the front end of the proposed SI task under mismatched conditions is mostly due to the abilities of EMD to decompose nonlinear nonstationary data without making any a priori assumptions, to separate the different frequency components existing in the data with the help of a basis function derived from the data itself, and to act as a dyadic filter property to filter white noise and fractional Gaussian noise at the front end of the SI task. Furthermore, detection of a CIMF among the set of IMFs obtained through the EMD of speech that very well preserves the characteristics of the original data and significant source excitation information, even under realistic conditions or degraded conditions, using ZFFPR at the front end of the SI task for VAD, mostly provides the enhanced recognition performance over energy-based VAD.

6 Conclusion

The study of SI tasks with EMD-based VAD employed at the front end provided improved recognition performance as discussed earlier. However, as the EMD-based VAD is newly available among its kind, further study of this EMD VAD is necessary to prove its strength under realistic conditions or mismatched conditions in a large population size. Furthermore, fine tuning of ZFFPR and EMD algorithmic parameters, and the study of the effect of sampling frequency on the performance of EMD for applications such as speech recognition and SR under realistic conditions with a large population size are necessary. Recently, the significance of a vowel-like region for SV under degraded conditions is studied in Refs. [42, 43]. In this line of study, further exploration is necessary to adaptively extract the vowel-like regions existing in speech utterances under realistic conditions by using the voiced regions extracted with the adaptive VAD algorithm using ZFFPR and EMD for the SV scenario. Furthermore, adaptive vowel-like regions and non-vowel-like region segmentation using EMD-based VAD under realistic conditions may provide further enhancement in recognition performance in speech-processing applications. In this study, preemphasis of speech is not carried out at the front end of both the baseline and the proposed method, as this is only the initial study of EMD-based VAD in an SI scenario under realistic conditions, to gain better insight into the efficacy of EMD-based VAD.

Corresponding author: M.S. Rudramurthy, Department of Information Science and Engineering, Siddaganga Institute of Technology, B.H. Road SIT Extension, Tumkur 572103, Karnataka, India, Phone: 09611207552, Fax: 0816-2282994, e-mail: rudrams@yahoo.com; rudrams2011@gmail.com

Bibliography

[1] E. Ambikairajah, Emerging features for speaker recognition, in: Proceedings of 6^th International Conference on Information, Communication and Signal Processing, pp. 1–7, Singapore, 2007.10.1109/ICICS.2007.4449889Search in Google Scholar

[2] T. V. Ananthapadmanabha and B. Yegnanarayana, Epoch extraction of voiced speech, IEEE T. Acoust. Speech. 23 (1975), 562570.10.1109/TASSP.1975.1162745Search in Google Scholar

[3] T. V. Ananthapadmanabha and B. Yegnanarayana, Epoch extraction from linear prediction residual for identification of closed glottis interval, IEEE T. Acoust. Speech. 27 (1979), 309319.10.1109/TASSP.1979.1163267Search in Google Scholar

[4] B. Atal, Automatic recognition of speakers from their voices, P. IEEE 64 (1976), 460–475.10.1109/PROC.1976.10155Search in Google Scholar

[5] V. Bajaj and R. B. Pachori, Classification of seizure and nonseizure EEG signals using empirical mode decomposition, IEEE T. Inf. Technol. B. 16 (2012), 1135–1142.10.1109/TITB.2011.2181403Search in Google Scholar PubMed

[6] V. Bajaj and R. B. Pachori, Epileptic seizure detection based on the instantaneous area of analytic intrinsic mode functions of EEG signals, Biomed. Eng. Lett. 3 (2013), 17–21.10.1007/s13534-013-0084-0Search in Google Scholar

[7] S. Bao, L. J. Pietrafesa, N. E. Huang, Z. Wu, D. A. Dickey, P. T. Gayes and T. Yan, An empirical study of tropical cyclone activity in the Atlantic and Pacific Oceans: 1851–2005. Adv. Adaptive Data Anal. 3 (2011), 291–307.Search in Google Scholar

[8] L. E. Baum and T. Petrie, Statistical inference for probabilistic functions of finite state Markov chains, Ann Math Stat. 37 (1966), 1554–1563.10.1214/aoms/1177699147Search in Google Scholar

[9] H. Beigi, Fundamentals of Speaker Recognition, p. 197, Springer, New York, ISBN 978-0-387-77591-3.Search in Google Scholar

[10] A. Berstein and I. Shallom, A hypothesized Wiener filtering approach to noisy speech recognition, in: ICASSP, pp. 913–916, 1991.10.1109/ICASSP.1991.150488Search in Google Scholar

[11] B. P. Bogert, M. J. R. Healy and J. W. Tukey, The quefrency alanysis of time series for echoes: cepstrum, pseudo autocovariance, cross-cepstrum and saphe cracking, in: Proceedings of the Symposium on Time Series Analysis, M. Rosenblatt, ed., Chapter 15, pp. 209–243, Wiley, New York, 1963.Search in Google Scholar

[12] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE IEEE T. Acoust. Speech. 27 (1979), 113–120.10.1109/TASSP.1979.1163209Search in Google Scholar

[13] J.-F. Bonastre, F. Bimbot, L.-J. Boe, J. P. Campbell, D. A. Reynolds and I. Magrin-Chagnolleau, Person authentication by voice: a need for caution, in: Proceedings of Eurospeech pp. 33–36, in Geneva, Switzerland, ISCA, 1–4 September 2003.Search in Google Scholar

[14] A. O. Boudra and J. C. Cexus, EMD-based signal filtering, IEEE T. Instrum. Meas. 56 (2007), 2196–2202.10.1109/TIM.2007.907967Search in Google Scholar

[15] P. Burrascano, Learning vector quantization for the probabilistic neural network, IEEE T. Neural Networ. 2 (1991), 458–461.10.1109/72.88165Search in Google Scholar PubMed

[16] J. P. Campbell, Speaker recognition – a tutorial, P. IEEE 85 (1997), 1437–1462.10.1109/5.628714Search in Google Scholar

[17] J. C. Chan and P. W. Tse, A novel, fast, reliable data transmission algorithm for wireless machine health monitoring, IEEE T. Reliab. 58 (2009), 295–304.10.1109/TR.2009.2020479Search in Google Scholar

[18] P. M. Crowley and T. Schildt, An analysis of the embedded frequency content of macroeconomic indicators and their counterparts using the Hilbert–Huang transform, Bank of Finland Research Discussion Papers 33, 2009.10.2139/ssrn.1513266Search in Google Scholar

[19] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE T. Acoust. Speech. 28 (1980), 357–366.10.1109/TASSP.1980.1163420Search in Google Scholar

[20] A. de la Torre, A. Peinado, J. Segura, J. Perez-Cordoba, M. Benitez and A. Rubio, Histogram equalization of speech representation for robust speech recognition, IEEE T. Speech Audi. P. 13 (2005), 355–366.10.1109/TSA.2005.845805Search in Google Scholar

[21] G. R. Doddington, Speaker recognition – identifying people by their voices, P. IEEE 73 (1985), 1651–1664.10.1109/PROC.1985.13345Search in Google Scholar

[22] A. Einstein, Sidelights on Relativity, 56 pp., Dover, Mineola, NY, 1983, ISBN: 9780486245119.Search in Google Scholar

[23] M. Faundez-Zanuy and E. Monte-Moreno, State-of-the-art in speaker recognition, IEEE A&E Syst. Mag. 20 (2005), 7–12.10.1109/MAES.2005.1432568Search in Google Scholar

[24] P. Flandrin, G. Rilling and P. Goncalves, Empirical mode decomposition as a filter bank, IEEE Signal Proc. Lett. 11 (2004), 112–114.10.1109/LSP.2003.821662Search in Google Scholar

[25] S. Furui, Cepstral analysis technique for automatic speaker verification, IEEE IEEE T. Acoust. Speech. 29 (1981), 254–272.10.1109/TASSP.1981.1163530Search in Google Scholar

[26] S. Furui, 50 Years of progress in speech and speaker recognition research, ECTI Transact. Computer Inform. Technol. 1 (2005), 64–74.Search in Google Scholar

[27] H. Gish and M. Schmidt, Text independent speaker identification, IEEE Signals Proc. Mag. 11 (1994), 18–32.Search in Google Scholar

[28] T. Hasan and J. H. L. Hansen, Robust speaker recognition in non-stationary room environments based on empirical mode decomposition, in: Proceedings of InterSpeech, Florence, Italy, 2011.10.21437/Interspeech.2011-150Search in Google Scholar

[29] L. He, M. Lech, N. C. Maddage and N. B. Allen, Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech, Biomedical Signal Processing and Control 6 (2011), 139–146.10.1016/j.bspc.2010.11.001Search in Google Scholar

[30] L. Hou and J. Xie, A new approach to extract formant instantaneous characteristics for speaker identification, International Journal of Computer Information Systems and Industrial Management Applications 1 (2009), 95–302.Search in Google Scholar

[31] H. Huang and J. Pan, Speech pitch determination based on Hilbert–Huang transform, Signal Proc. 86 (2006), 792–803.10.1016/j.sigpro.2005.06.011Search in Google Scholar

[32] N. E. Huang and S. S. P. Shen, Hilbert–Huang Transform and Its Applications, Inter Disciplinary Mathematical Sciences, vol. 5, World Scientific Co. Pvt. Ltd., Singapore, 2005.10.1142/5862Search in Google Scholar

[33] N. E. Huang and Z. Wu, A review on Hilbert–Huang transform: method and its applications to geophysical studies, Rev. Geophys. 46 (2008), 1–23.10.1029/2007RG000228Search in Google Scholar

[34] N. E. Huang, L. W. Salvino, Y.-Y. Nieh, G. Wang and X. Chen, HHT-based structural health monitoring, in: Health Assessment of Engineered Structures, pp. 203–240, doi: 10.1142/9789814439022_0008, 2013.10.1142/9789814439022_0008Search in Google Scholar

[35] N. E. Huang, Nonlinear evolution of water waves: Hilbert’s view, in: Proceedings of the International Symposium on Experimental Chaos, 2nd ed., W. Ditto et al., eds. pp. 327–341, World Scientific, Scotland, UK, 1995.Search in Google Scholar

[36] N. E. Huang, Z. Shen and S. R. Long, A new view of nonlinear water waves: the Hilbert spectrum, Annu. Rev. Fluid Mech. 31 (1999), 417–457.10.1146/annurev.fluid.31.1.417Search in Google Scholar

[37] N. E. Huang, Z. Shen, S. R. Long, M. L. Wu, H. H. Shih, Q. Sheng, N. C. Yen, C. C. Tung and H. H. Liu, The empirical mode decomposition and Hilbert spectrum for nonlinear and nonstationary time series analysis, P. Roy. Soc. Lond. A 454 (1998), 903–995.10.1098/rspa.1998.0193Search in Google Scholar

[38] T. Kinnunen, E. Karpov and P. Franti, Real-time speaker identification and verification, IEEE T. Speech Audi. Process 14 (2006), 277–288.10.1109/TSA.2005.853206Search in Google Scholar

[39] Y. Kopsinis and S. McLaughlin, Development of EMD-based denoising methods inspired by wavelet thresholding, IEEE T. Signal Proces. 57 (2009), 1351–1362.10.1109/TSP.2009.2013885Search in Google Scholar

[40] Y. Linde, A. Buzo and R. M. Gray, An algorithm for vector quantizer design, IEEE T. Commun. 28 (1980), 84–95.10.1109/TCOM.1980.1094577Search in Google Scholar

[41] T. Lizhen, Z. Ping and W. Xing, A speaker verification system based on EMD, in: 3rd International Conference on Genetic and Evolutionary Computing, October 2009, WGEC 09, pp. 553–556, 2009.Search in Google Scholar

[42] S. R. Mahadeva Prasanna and G. Pradhan, Significance of vowel-like regions for speaker verification under degraded conditions, IEEE Trans. Audio, Speech Lang. Proces. 19 (2011), 2552–2565.10.1109/TASL.2011.2155061Search in Google Scholar

[43] S. R. Mahadeva Prasanna and G. Pradhan, Speaker verification by vowel and non vowel like segmentation, IEEE Trans. Audio, Speech Lang. Proces. 21 (2013), 854–867.10.1109/TASL.2013.2238529Search in Google Scholar

[44] S. Mellone, L. Palmerini, A. Cappello and L. Chiari, Hilbert–Huang-based tremor removal to assess postural properties from accelerometers, IEEE T. Bio. Eng. 58 (2011), 1752–1761.10.1109/TBME.2011.2116017Search in Google Scholar PubMed

[45] J. Ming, T. J. Hazen, J. R. Glass and D. A. Reynolds, Robust speaker recognition in noisy conditions, IEEE Trans. Audio, Speech Lang. Proces. 15 (2007), 1711–1723.10.1109/TASL.2007.899278Search in Google Scholar

[46] K. I. Molla, K. Hirose, N. Minematsu and K. Hasan, Voiced/unvoiced detection of speech signals using empirical mode decomposition model, in: International Conference on Information and Communication Technology, ICICT-07, pp. 311–314, 2007.10.1109/ICICT.2007.375400Search in Google Scholar

[47] M. K. I. Molla, K. Hirose and N. Minematsu, Robust voiced/unvoiced speech classification using empirical mode decomposition and periodic correlation mode, in: Proceedings of InterSpeech 2008, pp. 2530–2533, Brisbane, Australia, 2008.Search in Google Scholar

[48] K. S. R. Murty and B. Yegnanarayana, Epoch extraction from speech signals, IEEE T. Audi., Speech Lang. Proces. 16 (2008), 1602–1613.10.1109/TASL.2008.2004526Search in Google Scholar

[49] M. Nalina, M. S. Rudramurthy and R. Kumaraswamy, EMD based VAD as preprocessing for speech recognition in noisy environment, in: National Conference on Recent Advances in Electronics and Communication Engineering (NCRAECE-13), pp. 344–348, 2013.Search in Google Scholar

[50] S. Ong and C.-H. Yang, A comparative study of text-independent speaker identification using statistical features, Int. J. Comput. Eng. Manage. 6 (1998), 40–51.Search in Google Scholar

[51] S. J. Orfanidis, Introduction to Signal Processing, Prentice Hall, International Edition, Upper Saddle River, NJ, ISBN: 0-13-209172-0, 1995.Search in Google Scholar

[52] J. Ortega-Garcia and J. Gonzalez-Rodriguez, Overview of speaker enhancement techniques for automatic speaker recognition, in: Proceedings of Fourth International Conference on Spoken Language Processing (ICSLP), vol. 2, October 1996, pp. 929–932, 1996.Search in Google Scholar

[53] R. B. Pachori, Discrimination between ictal and seizure-free EEG signals using empirical mode decomposition, Research Letters in Signal Processing 2008 5 pp., Article ID 293056.10.1155/2008/293056Search in Google Scholar

[54] R. B. Pachori and V. Bajaj, Analysis of normal and epileptic seizure EEG signals using empirical mode decomposition, Comput. Meth. Prog. Bio. 104 (2011), 373–381.10.1016/j.cmpb.2011.03.009Search in Google Scholar PubMed

[55] R. B. Pachori and S. Patidar, Epileptic seizure classification in EEG signals using second-order difference plot of intrinsic mode functions, Comput. Meth. Prog. Bio. 113 (2014), 494–502.10.1016/j.cmpb.2013.11.014Search in Google Scholar PubMed

[56] S. T. G. Raghukanth and S. Sangeetha, Empirical mode decomposition of earthquake accelerograms, Adv. Adaptive Data Anal. 04 (2012), 1250022.10.1142/S1793536912500227Search in Google Scholar

[57] D. A. Reynolds, Automatic speaker recognition using Gaussian mixture speaker models, The Lincon Laboratory Journal 8 (1995), 173–192.Search in Google Scholar

[58] D. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Commun. 17 (1995), 91–108.10.1016/0167-6393(95)00009-DSearch in Google Scholar

[59] D. Reynolds and R. C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models, IEEE T. Speech Audi. P. 3 (1995), 72–83.10.1109/89.365379Search in Google Scholar

[60] G. Rilling and P. Flandrin, One or two frequencies? The empirical mode decomposition answers, IEEE T. Signal Proces. 56 (2008), 85–95.10.1109/TSP.2007.906771Search in Google Scholar

[61] M. S. Rudramurthy, V. Kamakshi Prasad and R. Kumaraswamy, Voice activity detection algorithm using zero frequency filter assisted peaking resonator and empirical mode decomposition, J. Intell. Syst. 22 (2013), 269–282.10.1515/jisys-2013-0036Search in Google Scholar

[62] S. Senapati and G. Saha, Speaker identification by joint statistical characterization in the Log Gabor wavelet domain, Int. J. Intell. Sys. Technol. 2 (2007), 69–77.Search in Google Scholar

[63] S. S. Stevens and J. Volkman, The relation of pitch to frequency: a revised scale, Am. J. Psychol. 53 (1940), 353.10.2307/1417526Search in Google Scholar

[64] S. Suhadi, S. Stan, T. Fingscheidt and C. Beaugeant, An evaluation of VTS and IMM for speaker verification in noise, in: Eurospeech-2003, pp. 1669–1672, 2003.Search in Google Scholar

[65] R. Togneri and D. Pullella, An overview of speaker identification: accuracy and robustness issues, IEEE Circuits Syst. Mag. 11 (2011), 23–61.10.1109/MCAS.2011.941079Search in Google Scholar

[66] O. Viikki, D. Bye and K. Laurila, A recursive feature vector normalization approach for robust speech recognition in noise, in: Proceedings of ICASSP, pp. 733–736, 1998.Search in Google Scholar

[67] S. Wang, L. Yu and K. K. Lai, Forecasting crude oil price with an EMD-based neural network ensemble learning paradigm, Energy Econ. 30 (2008), 2623–2635.10.1016/j.eneco.2008.05.003Search in Google Scholar

[68] Z. Wu and N. E. Huang, A study of the characteristics of white noise using the empirical mode decomposition method, P. Roy. Soc. Lon. A 460 (2004), 1597–1611.10.1098/rspa.2003.1221Search in Google Scholar

[69] Z. Wu and N. E. Huang, Ensemble empirical mode decomposition: a noise-assisted data analysis method, Adaptive Data Anal. 01 (2009), 1–41.10.1142/S1793536909000047Search in Google Scholar

[70] J.-D. Wu and Y.-J. Tsai, Speaker identification system using empirical mode decomposition and an artificial neural network, Expert Syst. Appl. 38 (2011), 6112–6117.10.1016/j.eswa.2010.11.013Search in Google Scholar

[71] K.-H. Wu, C.-P. Chen and B.-F. Yeh, Noise robust speech feature processing with empirical mode decomposition, Eurasip J. Audi. Speech Music P. 2011 (2011), 1–9.Search in Google Scholar

[72] N. E. Huang, M.-L. Wu, W. Qu, S. R. Long and S. S. P. Shen, Applications of Hilbert–Huang transform to nonstationary financial time series analysis, Appl. Stoch. Model. Bus. 19 (2003), 245–268.10.1002/asmb.501Search in Google Scholar

[73] Z. Wu, N. E. Huang, J. M. Wallace, B. Smoliak and X. Chen, On the time-varying trend in global-mean surface temperature. Clim. Dynam. 37 (2011) 759–773.Search in Google Scholar

[74] X. Wu, E. K. Schneider, B. P. Kirtman, E. S. Sarachik, N. E. Huang and C. J. Tucker, The modulated annual cycle: an alternative reference frame for climate anomalies, Clim. Dynam. 31 (2008), 823–841.10.1007/s00382-008-0437-zSearch in Google Scholar

[75] H. Xie and Z. Wang, Mean frequency derived via Hilbert–Huang transform with application to fatigue EMG signal analysis, Comput. Meth. Prog. Bio. 82 (2006), 114–120.10.1016/j.cmpb.2006.02.009Search in Google Scholar PubMed

[76] R. Yan and R. X. Gao, Hilbert–Huang transform-based vibration signal analysis for machine health monitoring, IEEE T. Instrum. Meas. 55 (2006), 2320–2339.10.1109/TIM.2006.887042Search in Google Scholar

[77] D.-Y. Zhang, W.-M. Zuo, D. Zhang, H.-Z. Zhang, N.-M. Li, Wrist blood flow signal-based computerized pulse diagnosis using spatial and spectrum features, J. Bio. Sci. Eng. 3 (2010), 361–366.10.4236/jbise.2010.34050Search in Google Scholar

Received: 2013-11-2

Published Online: 2014-4-2

Published in Print: 2014-12-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Speaker Identification Using Empirical Mode Decomposition-Based Voice Activity Detection Algorithm under Realistic Conditions

Abstract

1 Introduction

2 Empirical Mode Decomposition

3 EMD-Based Voice Activity Detection Algorithm

4 SI System

4.1 Block Diagram of the Proposed SI System

4.2 Feature Extraction

4.3 Speaker Modeling

5 Experiment, Results, and Discussion

5.1 Database

5.2 Experiment

5.3 Result

5.4 Discussion

6 Conclusion

Bibliography

Journal and Issue

Articles in the same Issue