Vowel Phoneme Segmentation for Speaker Identification Using an ANN-Based Framework

Mousmita Sarma; Kandarpa Kumar Sarma

doi:10.1515/jisys-2012-0050

Open Access Published by De Gruyter April 24, 2013

Vowel Phoneme Segmentation for Speaker Identification Using an ANN-Based Framework

Mousmita Sarma and Kandarpa Kumar Sarma

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2012-0050

Abstract

Vowel phonemes are a part of any acoustic speech signal. Vowel sounds occur in speech more frequently and with higher energy. Therefore, vowel phoneme can be used to extract different amounts of speaker discriminative information in situations where acoustic information is noise corrupted. This article presents an approach to identify a speaker using the vowel sound segmented out from words spoken by the speaker. The work uses a combined self-organizing map (SOM)- and probabilistic neural network (PNN)-based approach to segment the vowel phoneme. The segmented vowel is later used to identify the speaker of the word by matching the patterns with a learning vector quantization (LVQ)-based code book. The LVQ code book is prepared by taking features of clean vowel phonemes uttered by the male and female speakers to be identified. The proposed work formulates a framework for the design of a speaker-recognition model of the Assamese language, which is spoken by ∼3 million people in the Northeast Indian state of Assam. The experimental results show that the segmentation success rates obtained using a SOM-based technique provides an increase of at least 7% compared with the discrete wavelet transform-based technique. This increase contributes to the improvement in overall performance of speaker identification by ∼3% compared with earlier related works.

Keywords: Speech; speaker; vowel; segment; codebook

1 Introduction

Phonemes are a linguistic abstraction with a high degree of variation of length in the acoustic speech signal and therefore difficult to differentiate into distinct segments. The acoustic appearance of phoneme varies according to their context as well as from speaker to speaker. Vowel phonemes are a part of any acoustic speech signal. Vowel sounds occur in a speech more frequently and with higher energy. Therefore, vowel phonemes can be used to extract different amounts of speaker-discriminative information in situations where acoustic information is noise corrupted. The use of vowel sound as a basis for speaker identification has been initiated long ago by the Speech Processing Group of the University of Auckland [25]. Since then, phoneme-recognition algorithms and related techniques have received considerable attention in the problem of speaker recognition and have even been extended to the linguistic domain. The role of the vowel phoneme is still an open issue in the field of speaker verification or identification. This is because vowel phoneme-based pronunciation varies with regional and linguistic diversity. Hence, segmented vowel speech slices can be used to track regional variation in the way the speaker speaks the language. It is more so in case of a language such as Assamese, which is spoken by ∼3 million people in the northeast state of Assam and has huge linguistic and cultural diversity, which influenced the way people speak the language. Therefore, an efficient vowel segmentation technique will be effective in speaker identification as well as for applications such as speech-to-text conversion and voice-activated system.

This article presents a vowel phoneme segmentation-based speaker-identification technique. A self-organizing map (SOM)- and probabilistic neural network (PNN)-based vowel segmentation technique is used to separate the vowel phoneme from words spoken by some trained Assamese male and female speakers. The set is further extended by recording the speech by the same group of persons under at least three to five different recording environments. The proposed technique is novel because such a composite ANN framework provides different segments of underlying structures and takes the most probable decision with the help of prior knowledge of the vowel pattern and characteristics. This work uses a set of samples formed by several speakers with gender, utterance, and recording background variation. The set is first used for training the framework and next is for testing it. A separate database of clean vowel phonemes is created by the same set of speakers. This clean vowel database is used to design a learning vector quantization (LVQ)-based code book by means of linear predictive coding (LPC) residual features. The linear prediction error sequence provides the speaker source information by subtracting the vocal tract effect, and therefore, it can be used as an effective feature for speaker recognition. Thus, the LVQ code book contains some unique codes for all the speakers in terms of the vowel sounds’ source pattern. Speaker identification is carried out by first segmenting the vowel sound from the speaker’s speech signal and then matching the vowel pattern with the LVQ code book. Assamese is a widely spoken language in Northeast India with vast linguistic diversity across different regions of the state and provides a sound area for research in phoneme-based speaker recognition. The objective of this work is to generate a platform for designing a speaker-recognition model using such regional sound variation between speakers. Very few works explored the used of such ANN-based segmentation [18]. However, in the case of Devanagari-based speech recognition, no such report has been observed. An earlier work [17] reported the role of vowel onset time (VOT) as a discriminative feature for speaker verification. Similarly, another [1] used the formant frequency of vowel to differentiate the speaker from each other. Another recent work [16] of similar nature but in different Indian languages has focused on source excitation and related features. There are also other similar studies [8, 11, 14, 15, 20] that have reported on other Indian languages, but to the best of our knowledge, no such speaker-recognition work has been reported on Assamese or other Indian languages that is based on a vowel phoneme segmentation technique, whereby the individual speaker pattern of the vowel is used to differentiate speakers.

The description included here is organized as below. Section 2 provides a brief account of the regional and phonemic variations of the Assamese language. The role of the linear prediction (LP) residual features in speaker identification is explained in Section 3. The proposed SOM- and PNN-based vowel segmentation technique and the LVQ code book-based speaker identification are explained in Sections 4.1 and 4.2, respectively. The results and related discussion are included in Section 5. Section 6 concludes the description.

2 Regional and Phonemic Diversity in Assamese Language

Assamese is an Indo-Aryan language that originated from the Vedic dialects, and therefore, a sister of all the northern Indian languages. Although the exact nature of the origin and growth of the language is yet to be ascertained, it is supposed that such as other Aryan languages, Assamese was also derived out of ApabhraṁŚa dialects developed from Mãgadhi Prakrit of the eastern group of Sanskritic languages [7]. Retaining certain features of its parent Indo-European family, it has many characteristic phonological forms, which makes Assamese speech unique, and hence requires an exclusive study to develop a speech-recognition or speaker-recognition system in Assamese (courtesy of Prof. Gautam Baruah [2]).

There are 23 consonants and 8 vowel phonemes in the standard colloquial Assamese. The Assamese vowel phoneme tables obtained from the work of Goswami [7] is shown in Figure 1.

Figure 1

Chart of Assamese Vowel Phonemes [7].

The eight vowels present three different types of contrasts [7]. First, eight-way contrasts in closed syllables and in open syllables when /i u/ do not follow in the next immediate syllable with intervention of a single consonants except the nasal. Again, it shows six-way contrasts in open syllables with /i/ occurring in the immediately following syllable with intervention of any single consonant except the nasals, or except with nasalization. Finally, five-way contrasts in open syllables when /u/ occurs in the immediately following syllable with a single consonant intervening [7].

Along with the distinctive phonemic diversity of the Assamese language, several regional dialects are typically recognized in different regions of the state. Key dialects are standard (central), eastern, Kamrupi, Goalparia, Mayang, and Jharwa (pidgin) Assamese. Dialects vary primarily with respect to phonology, most of the time due to the variation in the occurrence of the vowel sounds. For example, the Assamese word representing the vegetable gourd is pronounced differently in the above dialects. In Kamrupi dialect, this is pronounced as /kumra/, but in standard Assamese, it is pronounced as /komora/. Similarly, in standard Assamese, king is pronounced as /raoza/, whereas in the Kamrupi and Goalparia dialects, it is pronounced as /raza/ [6]. Thus, region to region and speaker to speaker, the language shows some notable variations, which clearly reflects the importance of designing an Assamese speaker-recognition system capable of dealing with such regional variations of occurrence of vowel sound in the same word.

3 Certain Theoretical Considerations of the ANN Framework and Linear Prediction Coding

As the distinctive feature of this work, we have used SOM, PNN, and LVQ for formulating the ANN framework for the segmentation part and the linear prediction residuals. In the following subsections, we briefly highlight the relevant notions of these techniques.

3.1 SOM, PNN, and LVQ

A brief treatment of the three soft computational tools is presented here, with descriptions modified to make it as relevant as possible for the present work.

3.1.1 SOM

This is a special form of ANN trained with unsupervised paradigm of learning. It follows a competitive method of learning that enables it to work effectively as a feature map, classifier, and, at times, filter. SOM has a special property of effectively creating spatially organized “internal representations” of various features of input signals and their abstractions [12]. SOM can be considered as a data visualization technique, i.e., it provides an underlying structure of data [9]. This idea is used in our vowel segmentation technique. If we train the same SOM with different epochs or iteration numbers for a particular epoch, SOM provides a weight vector consisting of the winning neuron along with the neighbors. Thus, with different epochs, different internal segments or patterns of speech signal can be obtained. Suppose we have a one-dimensional field of neurons, and say the LPC samples of spoken word uttered by the speaker has a form

When such an input vector is presented to the field of neurons, the algorithm will start to search the best matching weight vector W_i and thereby identify a neighborhood ℵ_i around the winning neuron. While doing so, it will try to minimize the Euclidian distance ‖X_k−W_i(k)‖. The adaptation of the algorithm will take place according to the relation

where learning rate η_k has the form

Thus, with the change of epoch number, different W_i will be obtained.

3.1.2 PNN

This is based on the statistical principles from Bayes decision strategy and non-parametric kernel-based estimators of probability density functions. PNNs handle data that have spikes and points outside the norm better than other ANNs. Therefore, the PNN is suitable for problems such as phoneme classification [23]. The PNN used in this work has the structure shown in Figure 2. Each pattern unit of the PNN forms a dot product of the input clean phoneme vector X with a weight vector W_i, Z_i=XW_i, and then performs a non-linear operation on Z_i before directing its activation level to the summation layer. Thus, the dot product is followed by a non-linear neuron activation function of the form

Figure 2

Structure of PNN.

where σ is the smoothing parameter. The summation units simply add the inputs from the pattern units that correspond to the category from which the training pattern was selected. The output, or decision, units of the PNN are two-input neurons that produces binary outputs between the two phoneme patterns obtained from the summation layer. The smoothing parameter, σ, plays a very important role in the proper classification of the input phoneme patterns. Because it controls the scale factor of the exponential activation function, its value should be the same for every pattern unit. As described by Specht [23], a small value of σ causes the estimated parent density function to have distinct modes corresponding to the locations of the training samples. A larger value of σ produces a greater degree of interpolation between points. A very large value of σ would cause the estimated density to be Gaussian regardless of the true underlying distribution. However, it has been found that in practical problems, it is not difficult to find a good value of σ, and the misclassification rate does not change dramatically with small changes in σ.

3.1.3 LVQ

This is a supervised version of vector quantization that can be used to create a code book. It models the discrimination function defined by the set of labeled code book vectors and the nearest neighborhood search between the code book and data [3]. An LVQ network has a first competitive layer and a second linear layer. The competitive layer learns to classify the input vectors in much the same way as the competitive layers of SOM. The linear layer transforms the competitive layer’s classes into target classifications defined by the user. The classes learned by the competitive layer are referred to as subclasses and the classes of the linear layer as target classes. The training algorithm involves an iterative gradient update of the winner unit [13].

3.2 Linear Prediction Coding

The source filter model of the vocal tract system can be represented by a discrete time-linear time-invariant filter, which is characterized by the system function of the form given in Eq. (5):

where the filter coefficients a_k and b_k change at a rate on the order of 50–100 times per second [19].

The LPC model is based on the mathematical approximation of the vocal tract. At a particular time, t, the speech sample s(t) is represented as a linear sum of the p previous samples [19]. Thus, the linear prediction error sequence or the LP residual provides the source information suppressing the vocal tract information obtained in the form of LPC coefficients [16]. The prediction error sequence possess the form of the output of an FIR linear system whose system function is

This is the prediction error filter, A(z), which is an inverse filter for the vocal tract system, H(z), i.e.,

According to Eq. (7), the zeros of A(z) are the poles of H(z) [19]. The source information obtained from the error sequence can be used as a relevant feature for speaker identification. The representation of source information in the LP residual depends upon the order of prediction. According to Pati and Prasanna [16], for a speech signal sampled at 8 kHz, the LP residual extracted using the LP order in the range of 8–20 best represents the speaker-specific source information. This work uses a prediction order of 20 to extract speaker-specific features for both clean vowel and word, which has provided a satisfactory success rate.

4 Proposed SOM-, PNN-, and LVQ-Based Speaker Identification Using Vowel-Segmentation Approach

Here we propose a novel ANN framework for speaker identification using vowel phoneme segmentation. The proposed technique has two broad components; hence, we describe these two aspects separately in the following sections.

4.1 Proposed SOM- and PNN-Based Vowel Segmentation

In most approaches, the speech signals are demarcated using constant-time segmentation, for example, into 25-ms blocks [4]. Static segmentation shows a risk of losing information on phonemes – different sounds may be merged into single blocks and individual phonemes may be lost completely. Discrete wavelet transform (DWT) is frequently used for such kind of phoneme segmentation [24]. However, DWT has the problem that, at each level, the signals are required to be reconstructed. We approached this issue using an ANN framework that does not need the reconstruction of the signal. The SOM- and PNN-based vowel segmentation technique proposed here is considered novel because such a composite ANN framework provides different segments of underlying structures and takes the most probable decision with the help of a prior knowledge of the vowel pattern and characteristics. Very few works using such ANN-based segmentation have been explored previously [18]. However, in the case of Devanagari-based speech recognition, no such report has been observed, and this work on speaker-recognition system is based on the Assamese language, which is robust enough to deal with variations resulting from linguistic, regional, and cultural backgrounds. The work used a SOM- and PNN-based technique to segment the vowel phoneme from any three phoneme words that the speaker utters. The validation of the vowel segmentation is checked by matching the first formant frequency (F1) of the vowel with the predetermined F1 value. The first formant frequency of all the Assamese vowels are estimated as explained by Sarma and Sarma [21].

The weight vector obtained by training a one-dimensional SOM with the LP features of a word is used in the work. This word is used to perform the segmentation. By training the same SOM block for various numbers of iterations, we get different weight vectors, each of which is considered as a segment of different phonemes constituting the word. From these segments, the relevant speech portions are recognized by pattern matching done with some PNNs that are trained to learn the patterns of all Assamese vowel phonemes. The SOM weight vector extraction algorithm can be summarized in the form of a block diagram shown in Figure 3.

Figure 3

SOM Segmentation Block.

The algorithm for a particular iteration can be mathematically stated as in Table 1. The SOM weight vectors thus extracted are stored as SW1, SW2, SW3, SW4, SW5, and SW6. The SOMs’ role is to provide segmentation boundaries for the phonemes. Here, six different segmentation boundaries were obtained with that many number separate sets of weight vectors. Figure 4 shows how SOM weight vectors change with the change of iteration number.

Table 1

SOM Weight Vector Extraction Algorithm.

1.	Input: Spoken word S of size m × n, sampling frequency f_s, duration T second
2.	Preprocess the signal using preprocessing algorithms described in Section 5.2
3.	Initialize P = 20, order of linear prediction
3.	Find the coefficients of a P^th-order linear predictor FIR filter, =−a(2)Q(n−1)−a(3)Q(n−2)−…−a(p+1)Q(n−p) that predicts the current value of the real-valued preprocessed time series Q, based on past samples.
4.	Store a = [1, a(2), a(3), …, a(p+1)]
5.	Take an topology map whose neurons are arranged in a 1×(P+1)-dimensional hexagonal pattern.
6.	Initialize weight W_i(k) to a small random number.
7.	Initialize learning parameter, η and neighbors, ℵ_I(k)
8.	for k = 1 to (p+1) do pick a(k) find winning neuron as, ‖a(k) − W_i(k)‖ = min₍₁≤_i≤_(p+1)) {‖a(k) − W_i(k)‖} Update synaptic vectors of the winning cluster, W_i(k) = W_i(k)+η_k(a(k) − W_i(k)) Update η_k, ℵ_I(k)
9.	Store updated weight W_i(k) as SW_j, where j = 1:6

Figure 4

SOM Weight Vectors.

Four PNNs trained with clean vowel phonemes were used for identifying the patterns of Assamese vowel speech. Here, a two-class PNN-based classification is performed, where four PNNs are trained with two clean vowel phonemes and are named PNN1, PNN2, PNN3, and PNN4, i.e., the output classes of PNN1 are /i/ and /u/, the output classes of PNN2 are /e/ and /o/, and so forth. These four PNNs are used sequentially to identify the segmented vowel phonemes. Clean vowels are recorded from five male and five female speakers used subsequently as the inputs in the input layer to the PNN, enabling it to provide each neuron the scope of learning the patterns and to group them per the derived decision. This happens in the pattern layer of the PNN. These speakers are again asked to record five variations of eight different vowel utterances under at least three to five types of background noise content. The details of the data sets used are provided in Section 5.1. These samples are used for training and testing purposes. The PNN learning algorithm can be stated as in Table 2.

Table 2

PNN Learning Algorithm.

1.	Statement: Classify input vowel patterns X into two category of vowel, VOWEL-A and VOWEL-B
2.	Initialize: Smoothing parameter σ=0.111 (Determined from observation of successful learning)
3.	Output of each pattern unit,
	Z_i=X·W_i
	(W_i is the weight vector)
4.	Find neuron activation function by performing non-linear operation of the form
	g(Z_i) = exp[(Z_i − 1)/σ²]
5.	Sum all the g(Z_i) for category VOWEL-A and do the same for category VOWEL-B
6.	Take binary decision for the two summation outputs with variable weight given by –

	where,
	h_A and h_B = Priori probability of occurrence of pattern from
	VOWEL-A and VOWEL-B, respectively
	I = Loss associated with wrong decision
	n_A and n_B=No of patterns in VOWEL-A and VOWEL-B, respectively,
	which is 10 for both categories

A two-step decision is taken by the recognition algorithm. First, the vowel segment is matched with the PNN patterns. Next, its first formant frequency F1 is checked, whether it lies in the predetermined range or not. The PNN- and F1-based recognition algorithm for a particular vowel /i/ can be stated as in Table 3. This process is repeated for the complete set of samples. Testing is done after training as well as the validation of the extent of learning taking place in the framework within the stipulated time.

Table 3

PNN- and F1-Based Recognition Algorithm.

1.	Input: Speech S of size m × n, sampling frequency f_s, duration T second
2.	Preprocess the signal using preprocessing algorithms described in Section 5.2
3.	Obtain SW1, SW2, SW3, SW4, SW5, SW6 using the SOM weight vector extraction algorithm described in Table 1
4.	Find F1 of SW1, SW2, SW3, SW4, SW5, SW6, and store as F_SW1, F_SW2, F_SW3, F_SW4, F_SW5 and F_SW6
5.	Load PNN1
6.	Decide VOWEL-A
	If SW1=VOWEL-A and F_SW1=F1 of vowel /i/
	else if SW2=VOWEL-A and F_SW2=F1 of vowel /i/
	else if SW3=VOWEL-A and F_SW3=F1 of vowel /i/
	else if SW4=VOWEL-A and F_SW4=F1 of vowel /i/
	else if SW5=VOWEL-A and F_SW5=F1 of vowel /i/
	else if SW6=VOWEL-A and F_SW6=F1 of vowel /i/
	else Decide
	“Not Assamese Vowel Phoneme /i/”.

4.2 Proposed LVQ Codebook-Based Speaker Identification

The speaker-specific information obtained from the LP method is used to design the LVQ-based code book. Clean vowels recorded from every speaker was first passed to the feature extraction block, and the resulting feature vector was used to train the LVQ network. LVQ is a method for training competitive layers in a supervised manner. In an LVQ network, a competitive layer learns to classify input vectors into target classes chosen by the user, unlike the strictly competitive layer possessed by the SOM [10]. The code book training uses the LVQ1 algorithm described in Section 3.1 and given by Kohonen et al. [13]. The LVQ code book provides a unique code for every speaker based on LP residual features. The speaker-recognition process uses the features of human vocal tract, which are different for different individuals. The LPC parameters can model the characteristics of the human vocal tract through the mathematical approximation of the vocal tract transfer function. Therefore, the LPC features were found to be suitable for the work. Clean vowel phonemes such as /i/, /u/, etc. recorded from five different male and female speakers were used to train the LVQ code book. Testing samples were created by performing the speech recording of the selected speakers under at least three to five background noise variations.

5 Experimental Details and Results

The experimental work was carried out per the flow diagram of Figure 5. First, clean vowels were used to design the LVQ-based speaker code book that stores the speaker source feature. Speaker identification was carried out by segmenting the vowel phoneme from any word uttered by the speaker and then matching the speaker patterns with the LVQ code book. The following sections provide a brief description of the speaker database and the preprocessing operation along with the experimental results.

Figure 5

Process Logic of the Proposed Algorithm.

5.1 Speaker Database and Speech Samples Collection

The database used for this work has 10 speakers (five male and five female) with the associated variations as reported earlier. In the first phase, the clean vowel phonemes such as /i/, /u/, etc. uttered by the 10 different speakers were recorded, on which LPC analysis was performed; the prediction error sequence was codified in the form of an LVQ codebook. The second phase of recording covered three phoneme words containing the vowel phonemes /xit/, /xir/, /xik/, /xis/, /dukh/, /dur/, /dut/, /dub/, /bex/, /bed/, /bes/, /bel/, /mon/, /mok/, /mor/, /mot/, /khεl/, /khεp/, /khεd/, /khεr/, /raon/, /raokh/, /raoth/, /raox/, /labh/, /laz/, /lath/, /lakh/, /nool/, /nood/, /nookh/, /noom/, etc. The words were recorded from the same male and female speakers because these were used as testing words for the speaker-identification process. We have covered five variations of words for each eight vowels of the Assamese language. This forms a data set with ∼400 samples with three to five different background noise variations. The set is further subdivided for training (30%), validation (30%), and the rest for testing purposes. Thus, a total of at least 40 sample words were used for the identification of one speaker with multiple recording background noise variations. For recording the speech signal, a PC headset and a sound-recording software, Gold Wave, were used. The recorded speech sample has the following specifications:

Duration = 2 s
Sampling rate = 8000 samples per second
Bit resolution = 16 bits per sample.

5.2 Preprocessing

The preprocessing of the speech signals consists of two operations – smoothing of the signal by median filtering and removal of the silent part using the threshold method. Although the speech signals were recorded in a noise-free environment, the presence of some unwanted spikes were observed. Therefore, a median filtering operation is performed on the raw speech signals, so that the vowel segmentation does not suffer from any type of an unwanted frequency component [19, 22, 24].

The smoothed signal S_smooth contains both speech and non-speech parts. The non-speech, or silent, part occurs in the speech signal before and after uttering the speech, and this time, information is considered to be redundant for the vowel segmentation purpose. The silent part ideally has a zero intensity. However, in practical cases, it is observed that even after smoothing, the silent part of the speech signal has an intensity ∼0.04. Our preprocessing algorithm considers this intensity value as the threshold, as in the algorithm shown in Table 4. Thus, a pure signal containing only the necessary speech part is obtained.

Table 4

Preprocessing Algorithm.

1.	Input: Speech S of size m × n, sampling frequency f_s, duration T second
2.	Output: The speech part of the signal Q of new size 1×q_size, duration t
3.	for i = 1 to n do
	for j = 1 to m do
4.	y_i = (S_i+S_j+1)/2
5.	end for
6.	end for
7.	initialize J = 2
8.	S_smooth = median[y(i−(J/2)), y(i−(J/2)+1), …, y(i+(J/2)−1)]
9.	z = S_smooth/max(S_smooth)
10.	Initialize Th = 0.04
11.	for i = 1 to n do
	if \|z\|>Th then q = z
	if \|z\|≤Th then q = 0
	end if
	end for
12.	Initialize Q =0
13.	for i = 1 to n do
	if q_i ≠ 0 then
	Q = q
	else Q = Q
	end if
	end for

5.3 Vowel Segmentation Results

The vowel phoneme from any word spoken by the speaker to be identified are segmented and recognized by the SOM-based segmentation and the PNN- and F1-based vowel-recognition algorithm as described in Section 4. The experiments were repeated for several times, and the success rates were calculated. The vowel-recognition success rate for various vowels is summarized in Table 5.

Table 5

SOM Segmentation Success Rate.

Sl no.	Vowel	Success rate of SOM (%)
1	/i/	98
2	/u/	95
3	/e/	96
4	/o/	97
5	/ε/	94
6	/ao/	99
7	/a/	99
8	/oo/	98

As can be seen from Table 5, the success rate obtained from SOM-based segmentation is satisfactory. To check the superiority of the present segmentation technique, we also performed the segmentation using the conventional DWT-based technique. The input spoken word is decomposed at six levels, which cover the frequency band of a human voice. It was observed that if the decomposed part of the speech signals are reconstructed at various levels, a different part of the signal is obtained. We used Daubechies’ wavelet as the mother wavelet function and four decomposition and reconstruction orthogonal wavelet filters. For a smaller-order Daubechies wavelet, we obtained a smaller wavelet and better time resolution. However, the frequency response of low-order wavelets has many sidelobes. By increasing the order, we get a smoother version of the mother wavelet, which is better for analyzing the voiced signal. Therefore, according to Tang et al. [24], we choose a 10th-order wavelet. It was observed that the segmentation success rate for various vowels abruptly increased (from 89.7% to 97%) with the SOM-based segmentation technique. Table 6 summarizes this performance difference.

Table 6

DWT vs. SOM Segmentation Success Rate.

Sl no.	Segmentation technique	Success rate (%)
1	DWT	89.3
2	SOM	96

5.4 Speaker Identification Results

The segmented vowels were next applied for pattern matching with the trained LVQ code book, which discriminates between speakers. Although there were 10 subjects (five male and five female) in this study, eight Assamese vowels were each segmented from the five different words recorded under three to five different background noises as described in Section 5.2. The speaker identification success rate for two different predictor sizes was summarized in Table 7. The experimental results show that with a predictor size of 20, the success rate shows an ∼3% improvement in comparison to previous reported result [5]. Tables 8 and 9 show the correct speaker identification of segmented vowels from five different words containing the vowel in the case of Boy Speaker I and Girl Speaker I, respectively. The correct identification of the speakers directly depends on the segmentation process. If an error occurs in the segmentation part, then obviously, speaker identification will go wrong.

Table 7

Speaker Identification Success Rate Over all the Vowel and Word Variations.

Sl no.	Speaker	Success rate (predictor size = 15) (%)	Success rate (predictor size = 20) (%)
1	Boy speaker 1	88	95.2
2	Boy speaker 2	83.2	92.3
3	Boy speaker 3	87	99
4	Boy speaker 4	90	95
5	Boy speaker 5	80.6	95.1
6	Girl speaker 1	85	98
7	Girl speaker 2	87.8	93.7
8	Girl speaker 3	89	91.9
9	Girl speaker 4	91	94
10	Girl speaker 5	79.4	97.2
Total	Average	86.1	95.14

Table 8

Correct Identification of Boy Speaker 1.

Sl no.	Segmented vowel	Word I	Word II	Word III	Word IV	Word V
1	/i/	✓	✓	✓	✓	✗
2	/u/	✓	✓	✓	✓	✓
3	/e/	✗	✓	✓	✓	✓
4	/o/	✓	✓	✓	✓	✗
5	/ε/	✓	✓	✓	✓	✓
6	/ao/	✗	✓	✓	✓	✓
7	/a/	✓	✗	✗	✓	✓
8	/oo/	✓	✓	✓	✗	✓

Table 9

Correct Identification of Girl Speaker 1.

Sl no.	Segmented vowel	Word I	Word II	Word III	Word IV	Word V
1	/i/	✗	✓	✓	✓	✗
2	/u/	✓	✓	✓	✗	✓
3	/e/	✓	✓	✓	✓	✗
4	/o/	✓	✓	✓	✓	✗
5	/ε/	✓	✓	✓	✓	✓
6	/ao/	✗	✓	✓	✓	✓
7	/a/	✓	✗	✓	✓	✓
8	/oo/	✓	✓	✓	✗	✓

The disadvantage of the proposed speaker-identification method is that the computational time is large due to the repetition of training of the SOM six times. Because the success rate reached a satisfactory level, the increase in speed is tolerable. It is to be noted that if the decision is wrong, then a longer time is needed because the algorithm passes through each and every decision level. Table 10 shows the computational time for wrong and correct decisions. The success rate attained and the efficiency of recognition performed with varied recording backgrounds established the framework to be an effective mechanism for speaker identification in Assamese.

Table 10

Computational Time.

Sl no.	Correct decision (s)	Wrong decision (s)
2	50–60	70–80

6 Conclusion

Here, we have proposed a prototype model for Assamese speaker identification using a combined technique of vowel segmentation and vowel speaker code book. The accuracy of speaker identification directly depends on vowel segmentation. It means the segmentation should be correct to obtain proper speaker discrimination. We have shown a novel SOM-based approach of vowel segmentation. This proposed vowel segmentation technique was found to be superior in comparison to the conventional DWT-based technique in term of success rate. The segmentation boundaries obtained from the SOM block are reinforced by a PNN unit using a pattern-matching technique. The speaker identification part is performed by an LVQ network that shows a dependence on the ability of the SOM-PNN combination to properly lay the segmentation boundaries of the vowels. The success rate obtained for speaker identification is well above the previously reported results. The marginal computational deficiency observed in the proposed approach can be removed with better hardware and design optimization. The work can be further improved by investigating certain speaker-dependent acoustic–phonetic features from vowels and fricatives of the language. It may also be extended to include the regional variations of native speakers with continuous speech inputs so that a complete speaker identification or verification system in Assamese can be designed.

Corresponding author: Kandarpa Kumar Sarma, Department of Electronics and Communication Technology and Technology, Gauhati University, Gopinath Bordoloi Nagar, Guwahati 781014, Assam, India, Phone: +91-361-2671262

Bibliography

[1] M. Alfaouri, K. Daqrouq and J. Al-Nabulsi, K-mean clustering and Arabic vowels formants based speaker identification system, Eur. J. Sci. Res.42 (2010), 420–431.Search in Google Scholar

[2] G. Baruah, Department of CSE, IIT Guwahati, India. tdil.mit.gov.in/AssameseCodeChartOct02.pdf.Search in Google Scholar

[3] H. Demuth, M. Beale and M. Hagan, Neural network toolbox 6 user’s guide (2010). www.mathworks.com. Accessed on January 23, 2010.Search in Google Scholar

[4] K. O. E. Elenius and H. G. C. Traven, Multi-layer perceptrons and probabilistic neural networks for phoneme recognition, in: Proc Eurospeech, 3rd European Conference on Speech Communication and Technology, pp. 1237–1240, Berlin, Germany, 1993.Search in Google Scholar

[5] N. Fakotakis, A. Tsopanoglou and G. Kokkinakis, A text-independent speaker recognition system based on vowel spotting, Speech Commun. 12 (1993), 57–68.10.1016/0167-6393(93)90018-GSearch in Google Scholar

[6] U. N. Goswami, An introduction to Assamese, Mani-Manik Prakash, Panzabar, Guwahati, Assam, India, 1978.Search in Google Scholar

[7] G. C. Goswami, Structure of Assamese, 1^st ed., Department of Publication, Gauhati University, Guwahati, Assam, India, 1982.Search in Google Scholar

[8] B. C. Haris, G. Pradhan, A. Misra, S. R. M. Prasanna, R. K. Das and R. Sinha, Multivariability speaker recognition database in Indian scenario, Int. J. Speech Technol.15 (2012), 441–453.10.1007/s10772-012-9140-xSearch in Google Scholar

[9] S. Haykin, Neural network and learning machine, 3^rd ed., PHI Learning Private Limited, New Delhi, India, 2009.Search in Google Scholar

[10] J. Hollmen, V. Tresp and O. Simula, A learning vector quantization algorithm for probabilistic models, in: Proceedings of EUSIPCO-2000, II, Tampere, Finland, 2000, 721–724.Search in Google Scholar

[11] H. S. Jayanna and S. R. M. Prasanna, Limited data speaker identification, Sadhana, Ind. Acad. Sci.35 (2010), 525–546.10.1007/s12046-010-0043-8Search in Google Scholar

[12] T. Kohonen, The self-organizing map, Proc. IEEE78 (1990), 1464–1480.10.1109/5.58325Search in Google Scholar

[13] T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen and K. Torkkola, LVQ PAK, The learning vector quantization program package, version 3.1, LVQ Programming Team of the Helsinki, University of Technology, Laboratory of Computer and Information Science, Finland, 1995.Search in Google Scholar

[14] R. Kumar, R. Ranjan, S. K. Singh, R. Kala, A. Shukla and R. Tiwari, Multilingual speaker recognition using neural network (2009). http://www.academia.edu/411825/.Search in Google Scholar

[15] V. L. Lajish, R. K. Sunil Kumar and P. Vivek, Speaker identification using a nonlinear speech model and ANN, Int. J. Adv. Inform. Technol.2 (2012), 15–24.10.5121/ijait.2012.2502Search in Google Scholar

[16] D. Pati and S. R. M. Prasanna, Speaker verification using excitation source information, Int. J. Speech Technol.15 (2012), 241–257.10.1007/s10772-012-9137-5Search in Google Scholar

[17] G. Pradhan and S. R. M. Prasanna, Significance of vowel onset point information for speaker verification, Int. J. Comput. Commun. Technol. (IJCCT)2 (2011), 60–66.Search in Google Scholar

[18] B. Qian, Z. Tang, Y. Li, L. Xu and Y. Zhang, Neural network ensemble based on vowel classification for Chinese speaker recognition, in: Proceedings of the 3^rdInternational Conference on Natural Computation, Washington, DC, USA, 2007, 3.10.1109/ICNC.2007.495Search in Google Scholar

[19] L. R. Rabiner and R. W. Schafer, Digital processing of speech signals, third impression, Pearson Education, New Delhi, 2009.Search in Google Scholar

[20] R. Ranjan, S. K. Singh, A. Shukla and R. Tiwari, Text-dependent multilingual speaker identification for indian languages using artificial neural network, in: Proceedings of the 3^rdInternational Conference on Emerging Trends in Engineering and Technology, ABV-IIITM, Gwalior, India, 2010, 632–635.10.1109/ICETET.2010.23Search in Google Scholar

[21] M. Sarma and K. K. Sarma, Formant frequency estimation of phonemes of Assamese speech, in: Proceedings of 2^ndIEEE National Conference on Computational Intelligence and Signal Processing, Guwahati, Assam, India, 2012, 119–124.10.1109/NCCISP.2012.6189691Search in Google Scholar

[22] M. Sarma and K. K. Sarma, Segmentation of Assamese phonemes using SOM, in: Proceedings of the 3^rdIEEE National Conference on Emerging Trends and Applications in Computer Science, St. Anthony’s College, Shillong, Meghalaya, India, 2012, 121–125.10.1109/NCETACS.2012.6203310Search in Google Scholar

[23] D. F. Specht, Probabilistic neural networks, Neural Networks3 (1990), 109–118.10.1016/0893-6080(90)90049-QSearch in Google Scholar

[24] B. T. Tang, R. Lang, H. Schroder, A. Spray and P. Dermody, Applying wavelet analysis to speech segmentation and classification, in: Wavelet Applications, Proc. SPIE 2242, pp. 750–761, Orlando, Florida, 1994.10.1117/12.170075Search in Google Scholar

[25] T. G. Templeton and B. J. Gullemin, Speaker identification based on vowel sounds using neural networks, in: Proceedings of the 3^rdInternational Conference on Speech Science and Technology, Melbourne, Australia, 1990, 280–285.Search in Google Scholar

Received: 2012-12-24

Published Online: 2013-04-24

Published in Print: 2013-06-01

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Vowel Phoneme Segmentation for Speaker Identification Using an ANN-Based Framework

Abstract

1 Introduction

2 Regional and Phonemic Diversity in Assamese Language

3 Certain Theoretical Considerations of the ANN Framework and Linear Prediction Coding

3.1 SOM, PNN, and LVQ

3.1.1 SOM

3.1.2 PNN

3.1.3 LVQ

3.2 Linear Prediction Coding

4 Proposed SOM-, PNN-, and LVQ-Based Speaker Identification Using Vowel-Segmentation Approach

4.1 Proposed SOM- and PNN-Based Vowel Segmentation

4.2 Proposed LVQ Codebook-Based Speaker Identification

5 Experimental Details and Results

5.1 Speaker Database and Speech Samples Collection

5.2 Preprocessing

5.3 Vowel Segmentation Results

5.4 Speaker Identification Results

6 Conclusion

Bibliography

Journal and Issue

Articles in the same Issue