Phone classification via manifold learning based dimensionality reduction algorithms

Phone classification via manifold learning based dimensionality reduction algorithms

Available online at www.sciencedirect.com ScienceDirect Speech Communication 76 (2016) 28–41 www.elsevier.com/locate/specom Phone classification via ...

1MB Sizes 2 Downloads 61 Views

Available online at www.sciencedirect.com

ScienceDirect Speech Communication 76 (2016) 28–41 www.elsevier.com/locate/specom

Phone classification via manifold learning based dimensionality reduction algorithms Heyun Huang, Louis ten Bosch ⇑, Bert Cranen, Lou Boves CLST/CLS, Radboud University, Nijmegen, Netherlands Received 8 May 2014; received in revised form 30 September 2015; accepted 29 October 2015 Available online 7 November 2015

Abstract Mechanical limitations imposed on the articulators during speech production lead to a limitation of the intrinsic dimensionality of speech signals. This limitation leads to a specific neighborhood structure of speech sounds when they are represented in a highdimensional feature space. We investigate whether phone classification can be improved by exploiting this neighborhood structure, by means of extended variants of the conventional Linear Discriminant Analysis (LDA) based on manifold learning. In this extended LDA approach, the within-class and between-class scatter matrices are defined in terms of adjacency graphs. We compare extensions of LDA that use either a full adjacency graph or an adjacency graph defined in the neighborhood of the training observations. In addition, we apply different kernels for weighing the distances in the graphs via different kernels, of which the Adaptive Kernel is proposed in this paper. Experiments with TIMIT show that while LDA algorithms that use the full adjacency graph do not outperform traditional LDA, the algorithms that exploit only local information provide significantly better results than traditional LDA. These improvements are not uniform across different broad phonetic classes, which suggests that the added value of the neighborhood structure is phone class dependent. The structure is represented by locally different densities in the neighborhood of feature vectors that are representative of a specific phone in a specific context. Ó 2015 Elsevier B.V. All rights reserved.

Keywords: Phone classification; TIMIT; Manifold learning; Graph embedding framework; LDA-based dimensionality reduction

1. Introduction The movements of articulators in the human speech production system are subject to mechanical and ballistic constraints. Due to these constraints the effective ‘intrinsic’ dimensionality of the set of acoustic features of speech signals is limited, even when these signals are represented in a high dimensional space. During the last decade several different attempts have been made to develop acoustic ⇑ Corresponding author.

E-mail addresses: [email protected] (H. Huang), l.ten[email protected] ru.nl (L. ten Bosch), [email protected] (B. Cranen), [email protected] (L. Boves). http://dx.doi.org/10.1016/j.specom.2015.10.005 0167-6393/Ó 2015 Elsevier B.V. All rights reserved.

representations of speech signals that benefit from the low intrinsic dimensionality, based on the insight that the local structure is dependent on the speech sound and its acoustic context, as determined by the temporal and spatial limitations imposed by the articulatory system. A number of approaches aimed at reestimating the movements of the vocal tract from the speech signals in the form of articulatory features (Frankel et al., 2007). Another research direction uses explicit parametric trajectories to capture the articulatory dynamics (Gish and Ng, 1996; Gong, 1997; Illina and Gong, 1997; Han et al., 2007; Zhao and Schultz, 2002), especially for vowels. The authors in Kim and Un (1997), Paliwal (1993), Wellekens (1987), Pinto et al. (2008), Russell (1993), Ostendorf et al. (1995), and

H. Huang et al. / Speech Communication 76 (2016) 28–41

Yun and Oh (2002) attempted to model the temporal dynamics by using conditional probability distributions. All approaches mentioned above try to express the information about articulatory continuity explicitly. And most of these approaches, if not all, mainly or exclusively aim at improving the performance of some Automatic Speech Recognition (ASR) system. Other research directions aim at using machine learning approaches to benefit from the fact that the intrinsic dimensionality of speech signals is limited, instead of directly attempting to obtain explicit parametric representations of the articulatory dynamics. These approaches take a (very) high-dimensional representation as a starting point, due to the fact that they capture temporal dynamics by stacking a number of 10 ms frames of spectral features (MFCCs, PLPs, Mel energy spectra, etc.) (e.g., De Wachter et al., 2007; Gemmeke et al., 2011; Tahir et al., 2011). In order to appropriately represent articulatory dynamics at the level of a syllable, feature representations must span at least 250 ms, i.e. 25 frames with a rate of 100 frames per second (Hermansky, 2010). Using 13dimensional MFCCs, this yields a feature space of dimension 25  13 ¼ 325. To exploit the fact that the intrinsic dimensionality of the speech signals is much lower than 325, and to avoid the ‘curse of dimensionality’ (Beyer et al., 1999), some form of dimensionality reduction is required. For example, in conventional ASR, Linear Discriminant Analysis (LDA) (Fisher, 1936) (also known as Fisher Discriminant Analysis, FDA) has often been used to map high-dimensional stacks of MFCC features to lower-dimensional feature vectors, while maximizing the information that discriminates between phone models (e.g., Haeb-Umbach and Ney, 1992; Erdogan, 2005; Pylkko¨nen, 2006). However, while most of the previous research into exploiting the effects of the low dimensionality of the articulatory system was aimed at improving ASR, recently an interest has emerged in harnessing the results of machine learning approaches to establish links with the large store of phonetic and phonological knowledge (Jansen and Niyogi, 2013). Interestingly, the authors of Jansen and Niyogi (2013) point out that the machine learning community has developed multiple algorithms that aim to discover the underlying low-dimensional structure in data, but that with the exception of ISOMAP (Tenenbaum et al., 2000; ten Bosch et al., 2011) none of these algorithms has been tested on a realistic speech task. While the authors in Jansen and Niyogi (2013) focus attention on the class of machine learning approaches based on the Graph Laplacian and the Laplace–Beltrami operator (see e.g. Singer, 2006), we here focus on extensions of LDA that allow for manifold learning in relation to the use of adjacency graphs. As in Jansen and Niyogi (2013), the goal of our research is to advance knowledge about the underlying structure in speech signals; a corollary goal is to understand the degree to which LDA algorithms that preserve the local neighborhood relations in the speech data can uncover and exploit structure. In other words,

29

the main objective of our study is to investigate to what extent knowledge about the distributions of the acoustic representation of phones – expressed in the form of neighborhood structure or manifolds – might be exploited for speech signal processing. Our goal is to gain understanding, rather than developing a particular step in an ASR processing cascade with minimization of error rates as single aim. For that reason we focus on a task that is closely related to general classification problems, namely phone classification. We use the TIMIT corpus as the test platform (Garofolo, 1988). It is because of these goals that we decided not to pursue the extremely fruitful research line of using Deep Neural Networks (DNN), e.g. (Seide et al., 2011; Hinton et al., 2012). Although (Huang et al., 2014) showed that the relative phone error rate decreased by phone-dependent proportions between 15.6% and 39.8% when they replaced a GMM-based posterior probability estimator by a DNN-based system, the results do not provide insight in the phonetic structure. In this paper, our aim is to better understand the phonetic structure by investigating the local structure in the adjacency graph representation of extensions of LDA, which is difficult to achieve by using DNNs. Classical LDA assumes that all classes that must be distinguished obey a single and homoscedastic normal distribution. In the phone classification task this assumption is highly unlikely to be true: the high degree of variation in the speech production process, in combination with the coarticulation with surrounding phones, will make the distributions within the phone classes much more complex (Jurafsky et al., 2001). Therefore, it appears useful to extend the traditional LDA by taking into account the resulting substructure in the acoustic space. Because a substantial part of the variation is systematic, rather than random, the acoustic space occupied by the speech signal is likely to be structured along (possibly several) lowerdimensional manifolds. This manifold structure in the acoustic space (the space defined by the feature representation) is likely to result from the locally different densities in the neighborhood of feature vectors that are representative of a specific phone in a specific context. In Yan et al. (2007) it was demonstrated that the neighborhood structure can be expressed in terms of adjacency graphs, and that different extensions of classical LDA can be unified in a general graph embedding framework. In this paper we investigate whether and to what extent the LDA algorithms subsumed by the framework of adjacency graphs can harness the neighborhood structure to the benefit of the TIMIT phone classification task. In addition, we will propose a novel adaptive kernel (based on older, well-known kernels, see e.g. Abramson, 1982; Kim and Scott, 1992) to extend one of the most promising LDA algorithms, i.e. heteroscedastic linear discriminant analysis (HLDA) (Kumar and Andreou, 1998; Burget, 2004; Sakai et al., 2009). Feature frames are represented by a single high-dimensional vector created by stacking 23 consecutive 13-dimensional MFCC vectors,

30

H. Huang et al. / Speech Communication 76 (2016) 28–41

which represent stretches of 230 ms of the speech signal. As in Neighborhood Components Analysis (NCA, Goldberger et al., 2004; Singh-Miller et al., 2007; SinghMiller and Collins, 2009), we use LDA for reducing the dimensionality of the feature frames, after which a k-nearest neighbor (k-NN) classifier is used to determine the most probable phone class for each feature frame. NCA derives a single transformation matrix that is applied to the feature frames irrespective of the neighborhood of the frames. In our approach the neighborhood structure plays a decisive role in determining the transformation matrix. However, NCA finds the transformation that optimizes the accuracy of the k-NN classifier in a leave-one-out experiment, while LDA optimizes a criterion that is surely related to, but not strictly dependent on, the accuracy of a k-NN classifier. The insight that speech (and image) data are characterized by manifolds, rather than by homoscedastic distributions in acoustic space has also been used in developing efficient coders. In Kambhatla and Leen (1997) it is shown that local Principal Component analysis (LPCA) yields excellent results in terms of reconstruction error for speech and image data. In this approach the training data are clustered, and PCA is performed in separate clusters. The combination of local linear PCAs can be seen as an accurate approximation of the non-linear manifold that support the data. That non-linear manifolds can be approximated by a mixture of linear models was already shown in Tipping and Bishop (1999), where probabilistic, rather than deterministic PCA analyzers are used. That a hard clustering of the data is not necessary was also shown in Chen et al. (2010), where the number of mixture components and their rank are inferred automatically from the data in the context of a compressive sensing application. However, all these papers are about coding, rather than about classification. Finally, it must be mentioned that graph-based LDA is part of a broader family of data processing methods that aim at preserving local neighborhood structure. These methods represent observations in terms of their distance to a (possibly small) number of neighbors. Local Linear Embedding (LLE) (Roweis and Saul, 2000) accomplishes a non-linear dimension reduction, similar to PCA. However, LLE maps all input observations into a single, lower-dimensional coordinate system. Thus, the method proposed in Roweis and Saul (2000) is not suitable for the purpose of this investigation. The remainder of this paper is organized as follows. In Section 2, we summarize the extensions of the classical LDA that make it possible to introduce neighborhood structure for representing the speech data. In Section 3 we briefly describe the overall architecture of the cascaded classifier used in the research. In Section 4 we describe the design of the experiments, the results of which are presented in Section 5. General discussions and conclusions of our work are presented in Sections 6 and 7.

2. Manifold learning based dimensionality reduction 2.1. The starting point: conventional LDA The starting point of Linear Discriminant Analysis is a data set comprising n observations xi 2 R D (i ¼ 1; 2; . . . ; n); each observation xi has a label cx 2 f1; 2; . . . ; Cg, which denotes the class of xi . The number of observations in a set that belong to class c is denoted by nc . The conventional Fisher Discriminant Analysis (FDA) (Fisher, 1936) aims to find the projection matrix W 2 RDd (d 6 minðD; CÞ) such that the low-dimensional representations zi 2 Rd obtained by zi ¼ W T xi have maximum discriminative power between the classes. W is the matrix that maximizes the Fisher ratio (1): trðW T S b W Þ trðW T S w W Þ

ð1Þ

where S w and S b denote the within-class and between-class scatter matrices, respectively. These matrices can be obtained by accumulating pairwise scatter matrices, i.e., T ðxi  xj Þðxi  xj Þ to obtain (Sugiyama and Roweis, 2007; Yan et al., 2007): 1 XX w T Sw ¼ a ðxi  xj Þðxi  xj Þ 2 i j ij ( ð2Þ 1=nc if cxi ¼ cxj ¼ c ðsame classÞ w aij ¼ 0 if cxi – cxj ðdifferent classÞ Sb ¼ abij

¼

1 XX b T a ðxi  xj Þðxi  xj Þ 2 i j ij ( 1=n  1=nc if cxi ¼ cxj ¼ c 1=n

ð3Þ

if cxi – cxj

The elements awij and abij make up the affinity matrices Aw and Ab , which determine the within-class and betweenclass scatter matrices. The entries in Aw and Ab are then interpreted in terms of weights in an adjacency graph (He and Niyogi, 2004). From Eqs. (2) and (3) it follows that all pairs of observations contribute to the affinity matrices and that the weights of these contributions only depend on whether the data points belong to the same class or not. In terms of adjacency graphs, all samples from the same class are fully connected to form a complete global neighborhood graph for the within-class scatter. Similarly, pairs of observations from different classes are fully connected to define the between-class scatter matrix. Since the classical FDA approach assigns equal weights to all pairs of data points (irrespective of their distance), it can be considered as a global approach. From the perspective of phone classification, however, it is not evident that a fully global approach is optimal. As mentioned in Section 1, different phones may have a different neighborhood structure. The different neighborhoods

H. Huang et al. / Speech Communication 76 (2016) 28–41

can be accounted for by modifying the affinity matrices Aw and Ab in Eqs. (2) and (3) such that they can better capture and preserve the local structure. This can be accomplished by modifying the affinity matrices and the corresponding adjacency graphs. 2.2. Local variants of LDA The idea of preserving local structure in linear dimensionality reduction by using an affinity matrix was first proposed in He and Niyogi (2004), and generalized in the graph-embedding framework in Yan et al. (2007). Local structure can be captured by assigning relatively larger weights to the connections of closer pairs, for example by defining the weights awij , abij in Eqs. (2) and (3) monotonically decreasing with the distance between the members of a pair (instead of being constant as in LDA). With kxi  xj k denoting the distances between pairs of observations this yields: 1 XX w Sw ¼ a ðxi  xj Þðxi  xj ÞT 2 i j ij ( w ð4Þ f ðkxi  xj kÞ=mc if cxi ¼ cxj ¼ c w aij ¼ 0 if cxi – cxj Sb ¼ abij

¼

1 XX b T a ðxi  xj Þðxi  xj Þ 2 i j ij ( 0 if cxi ¼ cxj f b ðkxi  xj kÞ=n

ð5Þ

if cxi – cxj

in which f w ðÞ and f b ðÞ are monotonically decreasing functions. For the purpose of normalization, the denominator mc in Eq. (4) is introduced, which stands for the number of graph-connected observations within the same class c. In global LDA variants mc becomes identical to nc , the number of all observations in class c. The coefficients abij are set to zero when xi and xj are from the same class. This is different from classical LDA (cf. Eq. (3)). The definition in Eq. (5) is based on the idea that the distances of two points from the same class should not impact the estimate of the between-class scatter.1 In the following subsections, we discuss the neighborhood properties of different definitions of the adjacency graphs and of the weights of the connections in these graphs. 2.2.1. Connectivity around each point In constructing the adjacency graphs, one can basically choose between two options: It can be shown mathematically that the setting abij ¼ 0 in Eq. (5) leads to the same solution as conventional LDA with Eq. (3), under the condition that the class sizes are the same. Differences between the conventional LDA and the local LDA based on Eq. (5) are due to unequal class size. This is confirmed by the results presented in Section 5. 1

31

 Complete graph: Both Local Fisher Discriminant Analysis (LFDA) (Sugiyama and Roweis, 2007) and its extension in the form of Globality-Locality Consistent Discriminant Analysis (GLCDA) (Huang et al., 2011) use a complete adjacency graph. Thus, the functions f w and f b in Eqs. (4) and (5) always yield a (possibly small) positive weight for each pair of observations.  Nearest neighbor (NN) graph: Local Discriminant Embedding (LDE) (Chen et al., 2005) constructs a local (partial) adjacency graph. Each data point xi is only connected to the Lw nearest neighbors from the same class and the Lb nearest neighbors in the other classes. Thus, the entire neighborhood of each point x is the w b union of two disjoint subsets N ðxÞ and N ðxÞ: w

w

N ðxÞ ¼ fzjcz ¼ cx ; kz  xk < kz  xL kg Lb

b

N ðxÞ ¼ fzjcz – cx ; kz  xk < kz  x kg

ð6Þ ð7Þ

w

where xL denotes the Lw -th nearest neighbor of x in the b subset of data points from the same class and xL the Lb -th nearest neighbor of x in the subset of data points that belong to a different class. The functions f w and f b in Eqs. (4) and (5) are defined such that w f w ðkz  xkÞ ¼ 0 when z R N ðxÞ and f b ðkz  xkÞ ¼ 0 b when z R N ðxÞ, effectively eliminating the corresponding edges. Comparing the two approaches to construct the graph, the former uses the functions f w and f b to weigh all edges in a fully connected graph, while the latter directly defines the local structure by keeping only the connections between observations in a limited neighborhood. Both approaches deal with a trade-off between local and global structure, albeit in different ways, using different kernels. 2.2.2. Weighting the edges of the adjacency graph The functions f w and f b in Eqs. (4) and (5) can be defined by means of different kernels.  The trivial kernel (also called ‘‘Simple-Minded Kernel”, He and Niyogi, 2004). This kernel assigns equal weights to all edges in the adjacency graph: f w ðkxi  xj kÞ ¼ 1

ð8Þ

f ðkxi  xj kÞ ¼ 1

ð9Þ

b

When applied to a fully-connected graph, this definition of the functions f w and f b yields an adjacency graph similar to the one in classical FDA: all points are connected and all connected pairs are considered equally important (Yan et al., 2007).  Exponentially-Decaying Kernels. Many exponentiallydecaying kernels can be defined. The authors in He and Niyogi (2004) proposed the so-called Heat Kernel as follows:

32

H. Huang et al. / Speech Communication 76 (2016) 28–41 2

f w ðkxi  xj kÞ ¼ expðkxi  xj k =tw Þ 2

f ðkxi  xj kÞ ¼ expðkxi  xj k =t Þ b

b

ð10Þ ð11Þ

Two parameters tw ; tb (tw ; tb > 0) are used to balance the influence of the global and local structure. Smaller values of these parameters result in larger weights for close pairs, making the graph more ‘‘local”. When one of these parameters goes to þ1, the corresponding function (f w or f b ) will approximate 1, which means that all data pairs are considered equally important. If both f w and f b approximate 1, the resultant kernel approximates the Simple-Minded Kernel.  The authors in Zelnik-Manor and Perona (2004) replaced the class-independent ‘‘t” parameters in Eqs. (10) and (11) by class-dependent scaling parameters ri and rj , arriving at: 2

ð12Þ

2

ð13Þ

f w ðkxi  xj kÞ ¼ expðkxi  xj k =rwi rwj Þ f b ðkxi  xj kÞ ¼ expðkxi  xj k =rbi rbj Þ rwp

¼

kxwp



xkp

w

k

ð14Þ

b

rbp ¼ kxbp  xkp k w

Fig. 1. Local weights as a function of between-token distance, according to four different combinations of graph construction methods and kernels. The data set is an artificial set consisting of 100 randomly generated observations from a two-dimensional normal distribution. For an explanation of the four different curves see the text.

ð15Þ b

where xkp and xkp ; p ¼ i or j, denote the k w -th nearest neighbor of xp from the same class and the k b -th nearest neighbor of xp in any other class. The normalization by the r’s makes the weights of the edges dependent on the density of the neighborhood of xi . However, Eqs. (12) and (13) do not allow weighting the local and global structure differently, as in the Heat Kernel. To enable using different weighting of local and global structure, we introduce a novel Adaptive Kernel, as a generalization of the Heat kernel, by introducing two exponents cw and cb : ! 2 kxi  xj k w f ðkxi  xj kÞ ¼ exp  w w cw ð16Þ ðri rj Þ ! 2 kxi  xj k b ð17Þ f ðkxi  xj kÞ ¼ exp  cb ðrbi rbj Þ

fied artificial example for different combinations of the graph construction method (fully connected vs. Nearest Neighbor) and the kernel (‘‘Simple-Minded” kernel vs. exponentially decaying kernel). To that end, a set of 100 points was randomly drawn from a two-dimensional Gaussian distribution. The number i on the horizontal axis in Fig. 1 refers to the ith-nearest neighbor measured from the mean of the set; along the vertical axis, the output of a number of the four different kernels is displayed. The graph construction methods and kernels shown are:

where the parameters cw and cb play a similar role as the tw and tb in the Heat kernel. Both the parameters cw ; cb (in Eqs. (16) and (17)) as well as the parameters k w ; k b (in Eqs. (14) and (15)) control the balance of local and global structure. If the values of k or c increase, the balance is shifted from the local structure towards the global structure of the data, and vice versa. There is no value of k w and k b for which rwi rwj and rbi rbj approximate 1 in Eqs. (12) and (13). The reduction to the Heat Kernel with tw ¼ tb ¼ 1 can only be achieved by setting both cs to zero.

 Complete Graph with Simple-Minded Kernel: This combination is depicted by the dotted line (black) in Fig. 1. In this setting all pairs obtain the same weight, regardless of the distance between the members. For the within-class scatter this setting is identical to classical LDA (Fisher, 1936).  Complete Graph with Exponentially-Decaying Kernel: The behavior of the Heat Kernel with tw ¼ 1:5 in Eq. (10) is shown by the dashed-dotted line (red2). It can be observed that Exponentially-Decaying Kernels may result in a relatively heavy tail.  Nearest Neighbor Graph with Simple-Minded Kernel: The dashed line (black) represents the output of the Simple-Minded Kernel after putting Lw in Eq. (6) (determining the size of the within-class adjacency graph) to 40. As a result the weights in the NN graph differ from those in the complete graph after the discontinuous transition at the 40th sample.

2.2.3. An artificial example The impact of the different kernels is difficult to present in a visual way. Fig. 1 shows these differences in a simpli-

2 For interpretation of color in Fig. 1, the reader is referred to the web version of this article.

H. Huang et al. / Speech Communication 76 (2016) 28–41

33

Fig. 2. Illustration of the integrated classifier. For all classifiers, the cascade consists of three different steps. The first step is the PCA step, which is present for all classifiers. The second step is either the conventional LDA, either one out of six different Dimensionality Reduction algorithms (extensions of LDA/ FDA), or the baseline in which there is no LDA step. The third (final) step consists of a weighted kNN (WkNN). The parameters on the arrows show the model parameters that must be optimized. In all cases, the parameters k; s of the WkNN in the third step must be optimized as well.

 Nearest Neighbor Graph with Exponentially-Decaying Kernel: this combination (solid line, red) emphasizes the importance of the closer neighbors. The weights start decreasing monotonically, and drop to zero at the 40th sample.

3. The integrated classifier The classifier used for assigning a phone class to each input vector is shown in Fig. 2. The classifier is composed of three modules, connected in a pipeline. First, as in the classical approach in classification tasks with very high-dimensional features (Fidler et al., 2006; Sha, 2007; Halberstadt, 1998), Principal Component Analysis (PCA) (Hotelling, 1933) is performed on the input vectors for orthogonalization and removal of redundancy. We keep the 150 dimensions that together account for almost all of the variance in the data. In Yang and Yang (2003) it is shown that the PCA transformation does not eliminate information that helps to discriminate between the classes. Recent classification techniques, based on Deep Neural Networks (DNN) (Hinton et al., 2012), avoid this PCA step. This might question the importance or necessity of the PCA step. In our approach, the most important goal of the PCA step is dimension reduction and avoidance of redundancy. In general, it depends on the data whether a PCA step before an LDA is desirable. If the number of points is comparable to the number of dimensions, or smaller, a single-step LDA will usually overfit. Reducing the

dimensionality with PCA before LDA may prevent overfitting (the PCA step thereby acting as a regularization step) and may therefore increase the LDA stability and performance. Importantly, the PCA step increases the scalability of the classification problem when the embedding dimension gets very large; the class separating LDA-step can then be formulated in terms of a lower and more functional dimensionality due to the PCA step. It is useful to realize that also in the DNN approaches it is important to effectively deal with redundancy and efficiency in feature extraction. In these network approaches, the use of bottleneck features constitute an alternative way to serve this purpose. In the second step, the dimension of the PCA output feature vectors is further reduced (to 47, i.e. the number of phone classes minus 1) by means of one of the six dimensionality reduction algorithms (all FDA/LDA-extensions, two types of connectivity, and three kernels) described in Section 2.2.2 and shown in Fig. 2. The dimensionality reduction by means of LDA is supposed to enhance the separation between the 47 phone classes in the features space. To provide a baseline, we include a seventh option, which consists of not using any LDA transformation or other dimensionality reduction in the second step. In the third and final step, the reduced feature vectors are classified using a common back end, which is a weighted k-Nearest Neighbor classifier. The k-NN classifier is commonly used to evaluate the performance of dimensionality reduction algorithms (e.g. Sugiyama and Roweis, 2007; Yan et al., 2007). The weights in the k-NN classifier are determined by a Heat Kernel, which is used

34

H. Huang et al. / Speech Communication 76 (2016) 28–41

to reduce the impact of points that are not in the direct neighborhood of an unlabeled data point. The label that is assigned to an unlabeled new point x is determined by a weighted majority vote among the labels in the neighborhood of x. If x1 ; x2 ; . . . ; xk denote the k nearest neighbors of the observation x, then the weights of these k neighbors are specified as follows: ! 2 kxi  xk ; i ¼ 1; 2; . . . ; k ð18Þ wi ¼ exp  s with k denoting the number of nearest neighbors considered and s a scaling parameter that determines the impact of the distance between points on the weight. After evaluating Eq. (18) on all k points in the neighborhood of x; x is assigned the label of the class whose members yield the highest accumulated score (weighted majority score). 4. Data and acoustic processing 4.1. Data: TIMIT We compare the local and global dimensionality reduction approaches by means of a phone classification task using the TIMIT (Garofolo, 1988) corpus. Although TIMIT is over 20 years old, it currently is still one of the most frequently used speech corpora for the purpose of phone classification and identification, as evidenced by the large number of papers at recent speech conferences that use TIMIT as the primary test bed for phone classification. TIMIT uses 61 different labels that cover all phones of American English. To evaluate the acoustic models, we mapped the 61 phone labels to 48 phones (in the same way as in Lee and Hon (1989)). The data set used to train all the classifiers is the standard NIST training set, which includes 462 speakers, 3696 utterances, and 139,852 phones. We also use the development set exactly as proposed in Halberstadt (1998), which comprises 50 speakers, 400 utterances, and 15,038 phone tokens. This set is referred to as set D. The core test set, containing 24 speakers, 192 utterances, and 7196 phone tokens, will be referred to as set C. Most papers on TIMIT phone classification report on the performance on set C, while using set D as development for tuning purposes. As will be explained in more detail in Section 5, we will also report results in which the roles of sets C and D are swapped, so that set C is used as a development set and set D is used as the independent test set. By doing so, we investigate the sensitivity of the different LDA-extensions to a specific data set. 4.2. Data preprocessing To generate the basic feature vectors a Short-Time Fourier Transformation is performed every 10 ms with a 25 ms Hamming window. The spectra are then converted to 13 MFCCs (c0 ; c1 ; . . . ; c12 ). To capture information from

the articulatory context, 23 frames are concatenated around the center frame of each phone. As a result, each segment is represented by the 11 preceding frames, the center frame itself, and 11 succeeding frames, which results in 13  23 ¼ 299-dimensional feature vectors. We determined the maximum and minimum values of the 13 coefficients across the full training database and then used these values to map the coefficients into the interval ½0; 1. Next, the 299 dimensional features are orthogonalized and part of the redundancy is removed by means of Principal Component Analysis (PCA) (Fidler et al., 2006). The first 150 eigenvectors account for 97% of the variance in the original 299dimensional vectors. The MFCC frames in the development and test sets were mapped into the (approximate) interval ½0; 1 using the normalization coefficients obtained with the training set. Subsequently, the 299-dimensional stacks of normalized frames, centered around the middle of the segments, were projected into a 150-dimensional space by means of the PCA matrix obtained with the training data. The resulting 150-dimensional feature vectors, together with their corresponding labels offered by TIMIT, were then used to evaluate the eight dimensionality reduction algorithms shown in Fig. 2. 5. Experiments In this section we present a number of experiments. In Section 5.1 we discuss an explorative grid search that was applied to determine the optimal parameter setting for each classifier. The criterion in the search was the eventual classification performance. In Section 5.2 we present and discuss the classification results obtained with the different dimensionality reduction methods. In Section 5.3 we investigate possible differences of the impact of dimensionality reduction on classification performance in different broad phonetic classes. 5.1. Explorative grid search As can be seen from Fig. 2, most of the classification approaches used here depend on two or more parameters. In order to optimize these, we tuned the parameters of the LDA extensions summarized in Fig. 2, in conjunction with the parameters (k; s) of the WkNN classifier. For this parameter optimization, we used a two-step strategy. In the first step, a coarse grid was applied to determine the upper and lower bounds of each of the parameters between which the optimal parameter values could be expected. In the second step, we performed a search on a finer grid based on the lower-/upper-bounds found in the first step. The fine grid search was skipped if the results for the coarse grid points did not show significantly different values. Traditional LDA and Complete Graph with SimpleMinded Kernel: except for k; s in the weighted k-NN, these methods do not require any parameter optimization. For the optimization of k and s, see below.

H. Huang et al. / Speech Communication 76 (2016) 28–41

35

the most promising intervals. With smaller step size for s equal to 0.25 the performance of all LDA variants under study appeared not to vary significantly. Therefore, we chose k ¼ 25 and s ¼ 4:5 for the weighted k-NN classifier in the remainder of this paper. The optimal parameter values obtained from this grid search are summarized in Table 1 for optimization using set D and in Table 2 for optimization using set C.

Complete Graph with Heat Kernel: in order to balance the importance of the local and global structure, the parameters tw and tb were first coarsely investigated using a range 0 6 tw , tb 6 5 with a step size 0.25 for both parameters. The fine grid was defined as 1:5 6 tw 6 2:5, 0:8 6 tb 6 2:2, using a step size 0.05 for both parameters. Complete Graph with Adaptive Kernel: The coarse grid search was done for cw and cb in the range 0 6 cw , cb 6 2 with a step size 0.05 for both parameters. Next, the fine grid search focused on the intervals 0:9 6 cw 6 1:5 (using a step size of 0.02) and 0:3 6 cb 6 1:3 (using a step size of 0.05). As mentioned in Section 2.2.2, the Adaptive kernel contains two additional parameters k w and k b , which also affect the trade-off between local and global structure. In Huang et al. (2011) we found that the best performance could be obtained with optimizing the cs with fixed values of the two k parameters. Therefore, we set k w ¼ 7 (as recommended in Zelnik-Manor and Perona (2004)) and k b ¼ 50, based on the ratio of total number of tokens and individual classes. Nearest Neighbor Graph with Simple-Minded Kernel: The parameters Lw and Lb in Eqs. (6) and (7) specify the number of nearest neighbors to be included in the adjacency graph for the within-class and between-class scatter, respectively. To be able to discover the impact of small local neighborhoods, we explored the range of Lw ; Lb starting from small natural numbers and it was found that 5 6 Lw 6 16, 10 6 Lb 6 100 (using step sizes of 1 and 5 respectively) were the relevant ranges. Nearest Neighbor Graph with Heat Kernel: A coarse grid search was defined by 0 6 tw , tb 6 5 (with a step size 0.25 for both parameters). No significant differences were obtained, so the fine grid search was skipped. Nearest neighbor Graph with Adaptive Kernel: In this case, the coarse grid was defined by 0 6 cw , cb 6 2 with a step size 0.05, while k w ¼ 7 and k b ¼ 50 were fixed. Also in this case, the coarse grid did not give rise to a fine search in a restricted interval.

5.2. Performance comparison: phone classification accuracy In this section, we compare the classification methods shown in Fig. 2. We will do so by comparing the classification accuracies that are obtained by using the different methods in combination with the parameter settings as specified in Tables 1 and 2. The results obtained with optimizing on set D and testing on the core test set C are shown in the column ‘‘Set C” in Table 3; the results obtained with optimization on set C and testing on set D are in the column ‘‘Set D”. The top row in that table can be considered as a baseline performance, which is obtained with the 150dimensional PCA vectors – that is, without any discriminative dimensionality reduction. From this table, a number of conclusions can be drawn. Firstly, it can be seen that all dimensionality reduction methods yield much higher accuracy than a plain weighted k-NN classification based on the 150-dimensional vectors retained after PCA. This shows that reducing the dimension of the high-dimensional feature vector of MFCC stacks leads to improved results in all cases. Second, the difference between conventional LDA and Local LDA with the complete graph and Simple-Minded Kernel is negligible. Thus, the difference between the definition of the between-class scatter in Eqs. (7) and (5) does not seem to have a substantial impact in our data sets. This suggests that the graph-based LDA is primarily determined by the local acoustic structure. Third, it appears that the methods using exponentially-decaying kernels outperform LDA with the Simple-Minded Kernel, again suggesting that the use of local structure of the data is beneficial. Fourth, when swapping the roles of set C and set D, that is, testing on set D and tuning on set C, the performance of all three NN Graph methods is significantly better than the best Complete Graph method. Despite the fact that the differences between the Complete Graph and the NN Graph methods on set C are not statistically significant, we see a similar trend: Limiting the adjacency graph to a relatively small

5.1.1. Weighted k-NN classifier For the weighted k-NN classifier two parameters (k; s) must be optimized. For this classifier, the classification performance was explored on a coarse grid defined by the parameter ranges 1 6 k 6 60 and 1 6 s 6 8, using a step size of 1 for both k and s. From this search we concluded that the parameter ranges 15 6 k 6 40 and 3 6 s 6 7 were

Table 1 Eventual optimal parameters settings (based on using a fine grid search) for each local method obtained by optimization on set D. Algorithm

tw

tb

cw

cb

Lw

Lb

Complete Graph with Heat Kernel Complete Graph with Adaptive Kernel NN Graph with Simple-Minded Kernel NN Graph with Heat Kernel NN Graph with Adaptive Kernel

1.85 – – 1.50 –

1.45 – – 2.50 –

– 0.94 – – 1.10

– 1.22 – – 1.30

– – 12 10 12

– – 60 60 55

36

H. Huang et al. / Speech Communication 76 (2016) 28–41

Table 2 Eventual optimal parameters settings (based on using a fine grid search) for each local method obtained by optimization on set C. Algorithm

tw

tb

cw

cb

Lw

Lb

Complete Graph with Heat Kernel Complete Graph with Adaptive Kernel NN Graph with Simple-Minded Kernel NN Graph with Heat Kernel NN Graph with Adaptive Kernel

2.15 – – 1.25 –

1.35 – – 2.50 –

– 1.16 – – 1.40

– 1.02 – – 1.25

– – 8 11 12

– – 75 55 85

Table 3 Classification accuracies. The column ‘‘Set C” shows the accuracy on the core test set C, using set D for development. The column ‘‘Set D” shows the accuracy when testing on set D, and using set C for development. Algorithm

Set D

Set C

150 dimensions after PCA Traditional LDA Complete Graph with Simple-Minded Kernel Complete Graph with Heat Kernel Complete Graph with Adaptive Kernel NN Graph with Simple-Minded Kernel NN Graph with Heat Kernel NN Graph with Adaptive Kernel

68.92 ± 0.74 74.26 ± 0.70 74.03 ± 0.70 75.01 ± 0.69 75.30 ± 0.69 76.93 ± 0.67 76.99 ± 0.67 76.91 ± 0.67

68.12 ± 1.08 73.66 ± 1.02 73.49 ± 1.02 74.82 ± 1.00 74.97 ± 1.00 75.58 ± 0.99 75.56 ± 0.99 75.61 ± 0.99

Fig. 3. Surface plot pertaining to set D. The figure shows the performance gain of the NN graph method with the Simple-Minded Kernel over the LDA method as a function of Lw and Lb . Numbers on the z-axix are proportions. For a more detailed description see the text.

neighborhood both for within and between class data points is a more effective way of capturing local structure than exponentially decreasing the weights of the distances in the complete graph. Also, the fact that in both test sets the performance with the Simple-Minded Kernel in the NN Graph methods does not differ significantly from the performance with the exponentially decaying kernels corroborates the finding that limiting the graph is more effective than weighting the distances. 5.2.1. The impact of Lw and Lb in the Nearest Neighbor Graph Table 3 shows that the NN-graph methods provide the best performance. More interesting than the optimal results

themselves may be the insight in how sensitive the performance of the cascaded classification systems is to changes in the trade-off between local and global information. The decisive parameters in these methods are Lw and Lb , that are used to construct the nearest neighbor sets around each point. Therefore, we investigated the impact of these parameters on the performance in more detail. Fig. 3 shows a surface plot of the performance on set D (the condition in which the advantage of the NN method over classical LDA and the Complete Graph methods was largest) as a function of Lw (on the X-axis) and Lb (on the Y-axis). On the Z-axis (vertical) is shown the difference in classification accuracy between the NN-method with Simple-Minded Kernel and the conventional LDA

H. Huang et al. / Speech Communication 76 (2016) 28–41

method. To aid interpretation, the figure includes two additional transparent horizontal planes. The lower one is located at z ¼ 0, and represents the performance level of the conventional LDA method. The upper plane is introduced to make the regions with superior performance visually conspicuous. In addition, contour plots are shown on the basis plane. Both the contour plots and the regions where the performance is significantly better than classical LDA in Fig. 3 show that the best performance is reached for values of Lw and Lb that are roughly specified by the relation Lb  5  Lw , with a minimum value of Lw ¼ 7. This relation can be interpreted in an interesting manner: it is not surprising that the between-class neighborhood should be substantially larger than the within-class neighborhood. It can be seen that the performance of the NN Graph method drops below classical LDA when either neighborhood becomes very small (meaning that only very few data points are considered) or when the size of the betweenclass neighborhood is no longer substantially larger than the within-class neighborhood. 5.2.1.1. Stability of the optimal parameters. Tables 1 and 2 suggest both similarities and differences between the parameter values that yield optimal results when using set D or set C for parameter tuning. As can be seen from the intervals in which no significant performance differences were observed for the parameters tw , tb , cw , cb in the exponentially decaying kernels, the differences between the optimal values for these parameters are probably due to idiosyncratic numerical properties of the data sets. With respect to the values of Lw , Lb in the NN Graph methods it might seem that there is a real difference between the two sets, especially for the NN Graph with SimpleMinded Kernel, where the best result suggests that Lb

37

should be almost 10 times as large as Lw for set C. However, as can be seen from Fig. 4, there seem to be two regions in which a performance that is significantly better than the classical LDA can be obtained: One region is similar to what we have seen in set D (i.e., optimal performance if Lb  5  Lw , and another region at relatively high values of Lb , combined with small values of Lw ). We suspect that the finding that the absolute maximum happened to be in the region with high values of Lb is caused by the relatively small size of the test set. To check this, we performed a grid search for the optimal values of Lw , Lb and the Simple-Minded Kernel for the combined test sets. The results (not shown here) indeed clearly confirmed that overall the best performance is achieved for Lb  5  Lw . 5.3. Performance comparison: broad phonetic class confusion The overall accuracy is a rather crude way to assess classification methods. Specifically, overall performance does not allow us to draw conclusions about the potential role of manifolds in the data space that might be modeled more accurately by LDA variants which do take local structure into account. Therefore, we investigate how a local method performs in distinguishing broad phonetic classes and in separating phones within the individual broad classes. This analysis should show how local methods trade betweenbroad-class accuracy (a problem in which local manifolds are probably not very important) for within-broad-class accuracy (where local structure might well be important). As a representative of the local methods, the NN-graph combined with the Simple-Minded Kernel is chosen, because of its simplicity and superiority over other local methods. To enhance comparability with previous results

Fig. 4. Surface plot pertaining to set C. The figure shows the performance gain of the NN graph method with the Simple-Minded Kernel over the LDA method as a function of Lw and Lb . For a more detailed description see the text.

38

H. Huang et al. / Speech Communication 76 (2016) 28–41

Table 4 Confusion matrix defined in terms of broad phonetic classes. Seven broad classes have been defined: plosives (PL), strong fricatives (SF), weak fricatives (WF), nasals/flap (NS), semi-vowels (Se-V), short vowels (Sh-V) and long vowels (Lo-V). The accuracies are provided in terms of percentages. Each cell contains two accuracies: the first result is based on the local method (NN graph and Simple-Minded Kernel), the second result (between parentheses) is based on conventional LDA (the number with ± refers to the 95% confidence interval). The columns refer to the ‘real’ reference label while the rows refer to the hypothesized label, respectively. In each diagonal cell, the bold number refers to the best classifier in that cell. All silences have been excluded from the analysis.

PL SF WF NS Se-V Sh-V Lo-V

PL

SF

WF

NS

Se-V

Sh-V

Lo-V

96.1 (96.2 ± 1.2) 1.8 (1.3) 1.1 (1.3) 0.8 (1.8) 0.1 (0.0) 0.0 (0.1) 0.1 (0.1)

1.6 (1.8) 95.2 (95.2 ± 1.6) 3.1 (2.8) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.2)

4.1 (4.3) 2.4 (4.3) 88.4 (84.2 ± 4.4) 0.7 (1.4) 3.8 (4.7) 0.3 (0.7) 0.3 (0.4)

0.2 (0.3) 0.0 (0.0) 1.4 (2.1) 95.1 (92.6 ± 1.7) 2.2 (2.9) 0.5 (1.7) 0.7 (0.3)

0.6 (0.5) 0.0 (0.0) 0.8 (0.7) 2.3 (2.7) 85.2 (81.9 ± 2.4) 4.7 (5.4) 6.4 (8.3)

0.2 (0.2) 0.0 (0.0) 0.2 (0.2) 1.4 (1.9) 2.0 (3.0) 85.8 (83.9 ± 1.9) 11.1 (11.5)

0.5 (0.4) 0.0 (0.0) 0.5 (1.0) 1.8 (1.8) 10.8 (11.4) 17.0 (18.1) 69.3 (67.3 ± 2.9)

in the literature, we perform the analysis for the core test set (set C), using the optimal parameter values Lw ¼ 12, Lb ¼ 60 as in Table 1. The classification results are presented in two tables. In Table 4 the 48 phone classes are categorized into seven broad phonetic classes (Halberstadt and Glass, 1997; Reynolds and Antoniou, 2003): plosives (PL), strong fricatives (SF), weak fricatives (WF), nasals (NS), semi-vowels (Se-V), short vowels (Sh-V) and long vowels (Lo-V). In Table 4, each cell contains the result of the NN Graph method and (in parentheses) the result of the traditional LDA. The diagonal also shows the 5% two-sided confidence interval in the binomial distribution. Columns represent the reference label (true label) and the rows represent the label assigned by the classifiers. The cells on the diagonal represent correct classification rates of the broad phonetic classes, while the off-diagonal cells refer to the confusions between different broad phone classes. The table shows that for nearly all broad phone classes, the NN Graph method outperforms the conventional LDA method, with the plosives as an exception. The improvements for NS, Se-V and Sh-V are significant, while the improvement for WF and Lo-V approach significance. The three classes that show significant improvement have in common that they contain voiced phonemes which are characterized by substantial dynamic changes in the 230 ms interval covered by the 23-frame blocks. Nasals are characterized by changes in the vocal tract shape as well as changes in the degree of acoustic coupling with the nasal tract. Semi-vowels are also characterized by opening/closing movements of the vocal tract, even if no full closure is obtained. Short vowels, which have an average duration of no more than 100 ms, will have full transitions from the preceding and into the succeeding consonant in the 230 ms interval. The absence of a significant improvement for the long vowels, which include the inherently dynamic diphthongs, may be due to the fact that the diphthongs make up less than half of the tokens in that class. If the gain in the NN-Graph method is related to a more accurate representation of the dynamics in a 230 ms interval, one would also expect a substantial improvement for the plosives. However, if anything, the performance for the plosives

deteriorates (albeit non-significantly). It remains to be seen whether this is due to the fact that the plosives do not form ‘compact’ manifolds; it might also be because MFCC coefficients, which are derived from 25 ms windows, do not capture the details of the extremely rapid transitions in the releases. MFCCs may also miss part of the relevant detail in fricatives, where a larger proportion of the information is captured by the residual signal that is no longer available. Table 5 contains the results for within-class confusions. The error rate (in percentage) within the individual broad phonetic classes obtained with the NN-Graph classifier is shown in the upper row; the classification performance obtained with the classical LDA is shown in the bottom row, together with the p-values obtained from a McNemar test of the significance of the differences between the two LDA variants. There are marginal and non-significant differences in both directions. The only significant difference is found for the strong fricatives. A detailed look at the data showed that the voiced/voiceless confusion between /s/ and /z/ is reduced when using the NN graph: specifically /z/ is less often classified as /s/. 6. Discussion The main objective of our study was to investigate to what extent knowledge about the distributions of the acoustic representation of phones – expressed in the form of neighborhood structure or manifolds – might be exploited for speech signal processing. Our goal is to gain understanding, rather than developing a particular step in an ASR processing cascade with minimization of error rates as single aim. To that end, we applied a frequently used, admittedly small-scale phone classification task. In addition, we investigated whether a newly designed extension of a Linear Discriminant Analysis (LDA) variant outperforms the less elaborated LDA variant on which it was based. It appears that reducing the original 299-dimensional acoustic space to 47 dimensions by means of a cascade of PCA and conventional LDA yields a significant improvement over reducing the original acoustic space to 150

H. Huang et al. / Speech Communication 76 (2016) 28–41

39

Table 5 Phone error rates (in percentages) per broad phonetic class. Top row: NN graph with Simple-Minded Kernel; bottom row: conventional LDA. The numbers in parentheses in the bottom row are p-values obtained by a McNemar test. Method

PL

SF

WF

NS

Se-V

Sh-V

Lo-V

NN graph LDA

18.9 17.4 (0.22)

14.4 17.4 (0.04)

4.7 4.4 (0.75)

17.9 20.5 (0.15)

7.7 9.8 (0.06)

29.2 28.8 (0.48)

7.2 8.2 (0.13)

dimensions by means of PCA alone. In addition, the graphbased LDA algorithms used in this paper were able to obtain a substantial additional gain in performance compared to using conventional LDA. The difference between conventional LDA and graph-based LDA with the fullyconnected graph and Simple-Minded Kernel is negligible, but with the two exponentially-decaying kernels (Heat and Adaptive) performance did improve. The best results were obtained with the LDA variants that use only small local neighborhoods. However, there is no difference between the performance with the three kernels used for weighting the neighborhood distances: the ‘SimpleMinded’ Kernel, the Heat Kernel and the Adaptive Kernel. Thus, it seems that the added complexity involved in the newly proposed Adaptive kernel cannot be justified by an improved accuracy. However, we can argue that the stability of the improvement for the small local-neighborhood variant of the LDA across different kernels now convincingly shows that the exact type of metric used in graphbased LDA is less relevant than the distinction between global and local information. Fig. 1 provides a clue for explaining this result. When using relatively small neighborhoods, as in this paper, the difference between equal weights and exponentiallydecaying weights may turn out to be too small to have a significant effect. It remains to be seen whether the Heat kernel and/or the Adaptive kernel can improve performance in tasks where the optimal neighborhood size is much larger. Another way for formulating the explanation for the small difference between the different kernels in the case of NN graphs is that the locality information is already sufficiently captured by using partial neighborhoods in the NN graph construction. An additional cause for the failure of the exponentially decaying kernels to yield improvements may be the density of the population of the samples. The value of the decaying exponentials as a function of distance from a specific point x decreases smoothly and slowly for the optimal values of the parameters (cf. Tables 1 and 2). Therefore, the edge weights obtained by using an Exponentially-Decaying kernel (Heat Kernel and Adaptive Kernel) do not differ much from those based on a SimpleMinded Kernel, especially if the neighborhood is densely populated. We investigated the performance of algorithms in two cases, in which the role of the set C and set D were swapped. The performance gain achieved for set D is larger than that for the TIMIT core test set (set C). This difference is most likely due to the differences in design of these sets. All sentences in the core test (set C) are different from the

sentences in the training set and the development (set D) (Halberstadt, 1998). This makes for optimal independence between the test material on the one hand and the training/tuning material on the other. However, there is some overlap between the sentences in the training set and the development set. Therefore, the test set is not completely independent of the training material (even if it does not overlap with the material in set C used for tuning). Moreover, set D is twice as large as set C, leading to more narrow confidence intervals in the statistical analyses. At the same time, it is encouraging to see that the optimal values of the parameters in the dimensionality reduction methods do not differ greatly between the two sets. This strengthens the belief that the local dimensionality reduction methods model structure in the acoustic space that is related to speech production, rather than simply better modeling the details of a specific data set. An analysis of the phone-phone confusions indicates that preserving the neighborhood structure of a small number of nearest neighbors helps to achieve a higher classification accuracy. This finding is potentially useful for an alternative approach for discriminative acoustic modeling in automatic speech recognition systems. Discriminative training in ASR mostly refers to modeling techniques in which acoustic models obtained by means of EM training are updated to minimize recognition errors. The local LDA approaches presented in this paper show how neighborhood structure in the feature space can be exploited to obtain posterior probabilities of the sub-word units, which might replace the likelihoods obtained with more conventional discriminative acoustic models. Our attempts to harness the manifold structure in a high-dimensional acoustic space method by means of LDA extensions is related to approaches based on Neighborhood Components Analysis (NCA). From the results in Singh-Miller et al. (2007), it might be inferred that NCA can yield a slightly better classification performance than manifold learning by means of graph-based LDA. However, NCA leaves us with a single transformation matrix that is optimized in terms of class separation, but it is difficult to interpret this matrix in terms of the neighborhood structure in the data, which was one of the goals of our research. In addition, our analysis of the confusions between and within broad phonetic classes suggest that a single transformation matrix is not an optimal solution. Our findings are in line with findings about the neighborhood structure in the acoustic data (and the phonedependency thereof) as reported in Jansen and Niyogi (2013).

40

H. Huang et al. / Speech Communication 76 (2016) 28–41

From the very start we intended to use phone classification accuracy as the criterion measure. This explains the focus on LDA-based algorithms for dimensionality reduction. However, there is substantial overlap between the 48 phones in the acoustic space. As a consequence, attempts to discriminate between indiscriminable labels may not be the best way for discovering the underlying manifold structure in acoustic representations of speech signals. Therefore, it is worth investigating whether unsupervised methods for discovering lower-dimensional manifolds such as mixtures of principal component analyzers (Tipping and Bishop, 1999) or factor analyzers (Chen et al., 2010) might in the end provide a more accurate approximation of the manifold structure. The resulting low-dimensional representations might provide a more versatile input for some classifier that is trained to do phone classification. 7. Conclusions Graph-based extensions of LDA are able to model local structure of data embedded in a high dimensional feature space. We showed the impact of several extensions of LDA that make it possible to capture local structure in the acoustic feature space for the purpose of phone classification – the local structure is captured by using neighborhood graphs in the dimensionality reduction algorithms. We compared two methods for constructing the graphs (fully or partially connected) and three methods for weighting the edges (uniform weights and two ways for making the weights inversely dependent on the distance between observations). The recognition tokens (phones in the TIMIT database) were represented by stacks of 23 consecutive 13-dimensional feature vectors centered around the phone’s midpoint as specified by the phone segmentation. We found that local dimensionality reduction yields a small but significant improvement over conventional LDA. On the one hand, local methods using a nearest neighbor graph (NN-graph) outperformed methods that account for local structure by applying exponentially decaying weights to the edges in a fully connected graph. On the other hand, the performance of the NN-graph methods could not be improved by replacing the uniform kernel by an exponentially decaying kernel. Further analysis on the NN-graph approach has shown that near-optimal performance can be achieved for several combinations of the two important parameters, the numbers of nearest neighbors for within-class and betweenclass scatters. To that end, the results show that the number of between-class neighbors should be about five times larger than the number of within-class neighbors. The improvement obtained by local dimensionality reduction is not equal for all broad phonetic classes. This suggests that the local manifold structure depends on the phonetic class, and most probably class-dependent dimensionality reduction methods can further improve phone classification performance.

Acknowledgements The authors would like to thank Jort F. Gemmeke (then affiliated to Katholieke Universiteit Leuven) and Yang Liu (Yale University) for their helpful suggestions on this paper. The research of Heyun Huang received funding from European Community’s Seventh Framework Programme [FP7] Initial Training Network SCALE, under Grant agreement No. 213850. Louis ten Bosch received funding from the FP7-SME project OPTI-FOX, project reference 262266. References Abramson, I.S., 1982. On bandwidth variation in kernel estimates a square root law. Ann. Statist. 10 (4), 1217–1223. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U., 1999. When is nearest neighbor meaningful? In: Proceedings of the International Conference on Database Theory, pp. 217–235. Burget, L., 2004. Combination of speech features using smoothed heteroscedastic linear discriminant analysis. In: Proceedings of Interspeech, pp. 2549–2552. Chen, H.-T., Chang, H.-W., Liu, T.-L., 2005. Local discriminant embedding and its variants. In: Proceeding of Computer Vision and Pattern Recognition, pp. 846–853. Chen, M., Silva, J., Paisley, J., Wang, C., Dunson, D., Carin, L., 2010. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: algorithm and performance bounds. IEEE Trans. Signal Process. 58 (12), 6140–6155. De Wachter, M., Matton, M., Demuynck, K., Wambacq, P., Cools, R., Van Compernolle, D., 2007. Template-based continuous speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 15, 1377– 1390. Erdogan, H., 2005. Regularizing linear discriminant analysis for speech recognition. In: Proceedings of Interspeech, pp. 3021 – 3024. Fidler, S., Skocaj, D., Leonardis, A., 2006. Combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling. IEEE Trans. Patt. Anal. Mach. Intell. 28, 337–350. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188. Frankel, J., Wester, M., King, S., 2007. Articulatory feature recognition using dynamic Bayesian networks. Comp. Speech Lang. 21 (4), 620– 640. Garofolo, J.S., 1988. Getting Started with the DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD. Gemmeke, J., Virtanen, T., Hurmalainen, A., 2011. Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio, Speech Lang. Process. 99, 2067–2080. Gish, H., Ng, K., 1996. Parametric trajectory models for speech recognition. In: Proceedings of International Conference of Acoustics, Speech and Signal Processing, pp. 466–469. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R., 2004. Neighbourhood components analysis. In: Proceedings of Neural Information Processing Systems, pp. 13–18. Gong, Y., 1997. Stochastic trajectory modeling and sentence searching for continuous speech recognition. IEEE Trans. Speech Audio Process. 5, 33–44. Haeb-Umbach, R., Ney, N., 1992. Linear discriminant analysis for improved large-vocabulary continuous-speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. San Francisco, CA, pp. 9–12. Halberstadt, A.K., Glass, J.R., 1997. Heterogeneous acoustic measurement for phonetic classification. In: Proceedings of Eurospeech, pp. 401–404.

H. Huang et al. / Speech Communication 76 (2016) 28–41 Halberstadt, A.K., 1998. Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition, Ph.D. thesis. MIT. Han, Y., de Veth, J., Boves, L., 2007. Trajectory clustering for solving the trajectory folding problem in automatic speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 15 (4), 1425–1434. He, X., Niyogi, P., 2004. Locality preserving projections. In: Proceedings of Neural Information Processing Systems. Hermansky, H., 2010. History of modulation spectrum in ASR. In: Proceedigns of the International Conference of Acoustics, Speech and Signal Processing, pp. 5458–5461. Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B., 2012. Deep neural networks for acoustic modelling in speech recognition. IEEE Signal Process. Magaz., 82–97 Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. J. Edu. Psychol. 24, 417–441, 498–520. Huang, H., Liu, Y., Gemmeke, J., ten Bosch, L., Cranen, B., Boves, L., 2011. Globality-locality consistent discriminant analysis for phone classification. In: Proceedings of Interspeech. Huang, Y., Yu, D., Liu, C., Gong, Y., 2014. A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Interspeech 2014 . Illina, I., Gong, Y., 1997. Elimination of trajectory folding phenomenon: HMM, trajectory mixture HMM and mixture stochastic trajectory model. In: International Conference of Acoustics, Speech and Signal Processing. Jansen, A., Niyogi, P., 2013. Intrinsic spectral analysis. IEEE Trans. Signal Process. 61, 1698–1710. Jurafsky, D., Ward, W., Zhang, J., Herold, K., Yu, X., Zhang, S., 2001. What kind of pronunciation variation is hard for triphones to model? In: Proceedings of the International Conference of Acoustics, Speech and Signal Processing, pp. 577–580. Kambhatla, N., Leen, T., 1997. Dimension reduction by local principal component analysis. Neural Comput. 9, 1493–1516. Kim, J., Scott, C., 1992. Variable kernel density estimation. Ann. Statist. 20, 1236–1265. Kim, N., Un, C., 1997. Frame-correlated Hidden Markov Model based on extended logarithmic pool. IEEE Trans. Audio, Speech, Lang. Process. 5, 149–160. Kumar, N., Andreou, A., 1998. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26, 283–297. Lee, K.F., Hon, H.W., 1989. Speaker-independent phone recognition using HMMs. IEEE Trans. Acoust., Speech Signal Process. 37 (11), 1641–1648. Ostendorf, M., Digalakis, V., Kimball, O.A., 1995. From HMMs to segment models: a unified view of stochastic modeling for speech recognition. IEEE Trans. Speech Audio Process. 4, 360–378. Paliwal, K.K., 1993. Use of temporal correlation between successive frames in a hidden Markov model based speech recognizer. In: International Conference of Acoustics, Speech and Signal Processing, pp. 215–218. Pinto, J., Yegnanarayana, B., Hermansky, H., Magimai Doss, M., 2008. Exploiting contextual information for improved phoneme recognition. In: Proceedings of the International Conference of Acoustics, Speech and Signal Processing. Pylkko¨nen, J., 2006. LDA based feature estimation methods for LVCSR. In: Proceedings of Interspeech, pp. 389–392.

41

Reynolds, T., Antoniou, C., 2003. Experiments in speech recognition using a modular MLP architecture for acoustic modelling. Inf. Sci., 39–54 Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), 2323–2326. http://dx.doi. org/10.1126/science.290.5500.2323. Russell, M., 1993. A segmental HMM for speech pattern modelling. In: Proceedings of the International Conference of Acoustics, Speech and Signal Processing, pp. 499–502. Sakai, M., Kitaoka, N., Takeda, K., 2009. Feature transformation based on discriminant analysis preserving local structure for speech recognition. In: Proceedings of International Conference of Acoustics, Speech and Signal Processing, pp. 3813–3816. Seide, F., Li, G., Chen, X., Yu, D., 2011. Feature engineering in contextdependent deep neural networks for conversational speech transcription. ASRU 2011. IEEE, . Sha, F., 2007. Large Margin Training of Acoustic Models for Speech Recognition, Ph.D. thesis. University of Pennsylvania. Singer, A., 2006. From graph to manifold Laplacian: the convergence rate. Appl. Comput. Harmon. Anal. 21, 128–134. Singh-Miller, N., Collins, M., 2009. Learning label embeddings for nearestneighbor multi-class classification with an application to speech recognition. In: Proceedings of Neural Information Processing Systems. Singh-Miller, N., Collins, M., Hazen, T.J., 2007. Dimensionality reduction for speech recognition using neighborhood components analysis. In: Proceedings of Interspeech, pp. 1158–1161. Sugiyama, M., Roweis, S., 2007. Dimensionality reduction of multimodal labeled data by local Fisher Discriminant Analysis. J. Mach. Learn. Res. 8, 1027–1061. Tahir, M., Schlueter, R., Ney, H., 2011. Log-linear optimization of second-order polynomial features with subsequent dimension reduction for speech recognition. In: Proceedings of Interspeech. ten Bosch, L., Ha¨ma¨la¨inen, A., Ernestus, M., 2011. Assessing acoustic reduction: exploiting local structure in speech. In: Proceedings of Interspeech. Florence, Italy, pp. 2665–2668. Tenenbaum, J.B., de Silva, V., Langford, J.C., 2000. A global geometric framework for nonlinear dimensionality reduction. Science, 2319– 2323. http://dx.doi.org/10.1126/science.290.5500.2319. Tipping, M.E., Bishop, C.M., 1999. Mixtures of probabilistic principal component analysers. Neural Comput. 11 (2), 443–482. Wellekens, C., 1987. Explicit time correlation in Hidden Markov Models for speech recognition. In: Proceedings of the International Conference of Acoustics, Speech and Signal Processing. Yan, S., Xu, D., Zhang, B., Zhang, H.-J., Yang, Q., Lin, S., 2007. Graph embedding and extension: a general framework for dimensionality reduction. IEEE Trans. Patt. Anal. Mach. Intell. 29, 40–51. Yang, J., Yang, J., 2003. Why can LDA be performed in PCA transformed space? Patt. Recog. 36 (2), 563–566. http://dx.doi.org/ 10.1016/S0031-3203(02)00048-1. Yun, Y., Oh, Y., 2002. A segmental-feature HMM for continuous speech recognition based on a parametric trajectory model. Speech Commun. 38, 115–130. Zelnik-Manor, L., Perona, P., 2004. Self-tuning spectral clustering. In: Proceedings of Neural Information Processing Systems, pp. 1601– 1608. Zhao, B., Schultz, T., 2002. Toward robust parametric trajectory segmental model for vowel recognition. In: Proceedings of the International Conference of Acoustics, Speech and Signal Processing.