- Email: [email protected]

ScienceDirect Speech Communication 76 (2016) 28–41 www.elsevier.com/locate/specom

Phone classiﬁcation via manifold learning based dimensionality reduction algorithms Heyun Huang, Louis ten Bosch ⇑, Bert Cranen, Lou Boves CLST/CLS, Radboud University, Nijmegen, Netherlands Received 8 May 2014; received in revised form 30 September 2015; accepted 29 October 2015 Available online 7 November 2015

Abstract Mechanical limitations imposed on the articulators during speech production lead to a limitation of the intrinsic dimensionality of speech signals. This limitation leads to a speciﬁc neighborhood structure of speech sounds when they are represented in a highdimensional feature space. We investigate whether phone classiﬁcation can be improved by exploiting this neighborhood structure, by means of extended variants of the conventional Linear Discriminant Analysis (LDA) based on manifold learning. In this extended LDA approach, the within-class and between-class scatter matrices are deﬁned in terms of adjacency graphs. We compare extensions of LDA that use either a full adjacency graph or an adjacency graph deﬁned in the neighborhood of the training observations. In addition, we apply diﬀerent kernels for weighing the distances in the graphs via diﬀerent kernels, of which the Adaptive Kernel is proposed in this paper. Experiments with TIMIT show that while LDA algorithms that use the full adjacency graph do not outperform traditional LDA, the algorithms that exploit only local information provide signiﬁcantly better results than traditional LDA. These improvements are not uniform across diﬀerent broad phonetic classes, which suggests that the added value of the neighborhood structure is phone class dependent. The structure is represented by locally diﬀerent densities in the neighborhood of feature vectors that are representative of a speciﬁc phone in a speciﬁc context. Ó 2015 Elsevier B.V. All rights reserved.

Keywords: Phone classiﬁcation; TIMIT; Manifold learning; Graph embedding framework; LDA-based dimensionality reduction

1. Introduction The movements of articulators in the human speech production system are subject to mechanical and ballistic constraints. Due to these constraints the eﬀective ‘intrinsic’ dimensionality of the set of acoustic features of speech signals is limited, even when these signals are represented in a high dimensional space. During the last decade several diﬀerent attempts have been made to develop acoustic ⇑ Corresponding author.

E-mail addresses: [email protected] (H. Huang), l.ten[email protected] ru.nl (L. ten Bosch), [email protected] (B. Cranen), [email protected] (L. Boves). http://dx.doi.org/10.1016/j.specom.2015.10.005 0167-6393/Ó 2015 Elsevier B.V. All rights reserved.

representations of speech signals that beneﬁt from the low intrinsic dimensionality, based on the insight that the local structure is dependent on the speech sound and its acoustic context, as determined by the temporal and spatial limitations imposed by the articulatory system. A number of approaches aimed at reestimating the movements of the vocal tract from the speech signals in the form of articulatory features (Frankel et al., 2007). Another research direction uses explicit parametric trajectories to capture the articulatory dynamics (Gish and Ng, 1996; Gong, 1997; Illina and Gong, 1997; Han et al., 2007; Zhao and Schultz, 2002), especially for vowels. The authors in Kim and Un (1997), Paliwal (1993), Wellekens (1987), Pinto et al. (2008), Russell (1993), Ostendorf et al. (1995), and

H. Huang et al. / Speech Communication 76 (2016) 28–41

Yun and Oh (2002) attempted to model the temporal dynamics by using conditional probability distributions. All approaches mentioned above try to express the information about articulatory continuity explicitly. And most of these approaches, if not all, mainly or exclusively aim at improving the performance of some Automatic Speech Recognition (ASR) system. Other research directions aim at using machine learning approaches to beneﬁt from the fact that the intrinsic dimensionality of speech signals is limited, instead of directly attempting to obtain explicit parametric representations of the articulatory dynamics. These approaches take a (very) high-dimensional representation as a starting point, due to the fact that they capture temporal dynamics by stacking a number of 10 ms frames of spectral features (MFCCs, PLPs, Mel energy spectra, etc.) (e.g., De Wachter et al., 2007; Gemmeke et al., 2011; Tahir et al., 2011). In order to appropriately represent articulatory dynamics at the level of a syllable, feature representations must span at least 250 ms, i.e. 25 frames with a rate of 100 frames per second (Hermansky, 2010). Using 13dimensional MFCCs, this yields a feature space of dimension 25 13 ¼ 325. To exploit the fact that the intrinsic dimensionality of the speech signals is much lower than 325, and to avoid the ‘curse of dimensionality’ (Beyer et al., 1999), some form of dimensionality reduction is required. For example, in conventional ASR, Linear Discriminant Analysis (LDA) (Fisher, 1936) (also known as Fisher Discriminant Analysis, FDA) has often been used to map high-dimensional stacks of MFCC features to lower-dimensional feature vectors, while maximizing the information that discriminates between phone models (e.g., Haeb-Umbach and Ney, 1992; Erdogan, 2005; Pylkko¨nen, 2006). However, while most of the previous research into exploiting the eﬀects of the low dimensionality of the articulatory system was aimed at improving ASR, recently an interest has emerged in harnessing the results of machine learning approaches to establish links with the large store of phonetic and phonological knowledge (Jansen and Niyogi, 2013). Interestingly, the authors of Jansen and Niyogi (2013) point out that the machine learning community has developed multiple algorithms that aim to discover the underlying low-dimensional structure in data, but that with the exception of ISOMAP (Tenenbaum et al., 2000; ten Bosch et al., 2011) none of these algorithms has been tested on a realistic speech task. While the authors in Jansen and Niyogi (2013) focus attention on the class of machine learning approaches based on the Graph Laplacian and the Laplace–Beltrami operator (see e.g. Singer, 2006), we here focus on extensions of LDA that allow for manifold learning in relation to the use of adjacency graphs. As in Jansen and Niyogi (2013), the goal of our research is to advance knowledge about the underlying structure in speech signals; a corollary goal is to understand the degree to which LDA algorithms that preserve the local neighborhood relations in the speech data can uncover and exploit structure. In other words,

29

the main objective of our study is to investigate to what extent knowledge about the distributions of the acoustic representation of phones – expressed in the form of neighborhood structure or manifolds – might be exploited for speech signal processing. Our goal is to gain understanding, rather than developing a particular step in an ASR processing cascade with minimization of error rates as single aim. For that reason we focus on a task that is closely related to general classiﬁcation problems, namely phone classiﬁcation. We use the TIMIT corpus as the test platform (Garofolo, 1988). It is because of these goals that we decided not to pursue the extremely fruitful research line of using Deep Neural Networks (DNN), e.g. (Seide et al., 2011; Hinton et al., 2012). Although (Huang et al., 2014) showed that the relative phone error rate decreased by phone-dependent proportions between 15.6% and 39.8% when they replaced a GMM-based posterior probability estimator by a DNN-based system, the results do not provide insight in the phonetic structure. In this paper, our aim is to better understand the phonetic structure by investigating the local structure in the adjacency graph representation of extensions of LDA, which is diﬃcult to achieve by using DNNs. Classical LDA assumes that all classes that must be distinguished obey a single and homoscedastic normal distribution. In the phone classiﬁcation task this assumption is highly unlikely to be true: the high degree of variation in the speech production process, in combination with the coarticulation with surrounding phones, will make the distributions within the phone classes much more complex (Jurafsky et al., 2001). Therefore, it appears useful to extend the traditional LDA by taking into account the resulting substructure in the acoustic space. Because a substantial part of the variation is systematic, rather than random, the acoustic space occupied by the speech signal is likely to be structured along (possibly several) lowerdimensional manifolds. This manifold structure in the acoustic space (the space deﬁned by the feature representation) is likely to result from the locally diﬀerent densities in the neighborhood of feature vectors that are representative of a speciﬁc phone in a speciﬁc context. In Yan et al. (2007) it was demonstrated that the neighborhood structure can be expressed in terms of adjacency graphs, and that diﬀerent extensions of classical LDA can be uniﬁed in a general graph embedding framework. In this paper we investigate whether and to what extent the LDA algorithms subsumed by the framework of adjacency graphs can harness the neighborhood structure to the beneﬁt of the TIMIT phone classiﬁcation task. In addition, we will propose a novel adaptive kernel (based on older, well-known kernels, see e.g. Abramson, 1982; Kim and Scott, 1992) to extend one of the most promising LDA algorithms, i.e. heteroscedastic linear discriminant analysis (HLDA) (Kumar and Andreou, 1998; Burget, 2004; Sakai et al., 2009). Feature frames are represented by a single high-dimensional vector created by stacking 23 consecutive 13-dimensional MFCC vectors,

30

H. Huang et al. / Speech Communication 76 (2016) 28–41

which represent stretches of 230 ms of the speech signal. As in Neighborhood Components Analysis (NCA, Goldberger et al., 2004; Singh-Miller et al., 2007; SinghMiller and Collins, 2009), we use LDA for reducing the dimensionality of the feature frames, after which a k-nearest neighbor (k-NN) classiﬁer is used to determine the most probable phone class for each feature frame. NCA derives a single transformation matrix that is applied to the feature frames irrespective of the neighborhood of the frames. In our approach the neighborhood structure plays a decisive role in determining the transformation matrix. However, NCA ﬁnds the transformation that optimizes the accuracy of the k-NN classiﬁer in a leave-one-out experiment, while LDA optimizes a criterion that is surely related to, but not strictly dependent on, the accuracy of a k-NN classiﬁer. The insight that speech (and image) data are characterized by manifolds, rather than by homoscedastic distributions in acoustic space has also been used in developing eﬃcient coders. In Kambhatla and Leen (1997) it is shown that local Principal Component analysis (LPCA) yields excellent results in terms of reconstruction error for speech and image data. In this approach the training data are clustered, and PCA is performed in separate clusters. The combination of local linear PCAs can be seen as an accurate approximation of the non-linear manifold that support the data. That non-linear manifolds can be approximated by a mixture of linear models was already shown in Tipping and Bishop (1999), where probabilistic, rather than deterministic PCA analyzers are used. That a hard clustering of the data is not necessary was also shown in Chen et al. (2010), where the number of mixture components and their rank are inferred automatically from the data in the context of a compressive sensing application. However, all these papers are about coding, rather than about classiﬁcation. Finally, it must be mentioned that graph-based LDA is part of a broader family of data processing methods that aim at preserving local neighborhood structure. These methods represent observations in terms of their distance to a (possibly small) number of neighbors. Local Linear Embedding (LLE) (Roweis and Saul, 2000) accomplishes a non-linear dimension reduction, similar to PCA. However, LLE maps all input observations into a single, lower-dimensional coordinate system. Thus, the method proposed in Roweis and Saul (2000) is not suitable for the purpose of this investigation. The remainder of this paper is organized as follows. In Section 2, we summarize the extensions of the classical LDA that make it possible to introduce neighborhood structure for representing the speech data. In Section 3 we brieﬂy describe the overall architecture of the cascaded classiﬁer used in the research. In Section 4 we describe the design of the experiments, the results of which are presented in Section 5. General discussions and conclusions of our work are presented in Sections 6 and 7.

2. Manifold learning based dimensionality reduction 2.1. The starting point: conventional LDA The starting point of Linear Discriminant Analysis is a data set comprising n observations xi 2 R D (i ¼ 1; 2; . . . ; n); each observation xi has a label cx 2 f1; 2; . . . ; Cg, which denotes the class of xi . The number of observations in a set that belong to class c is denoted by nc . The conventional Fisher Discriminant Analysis (FDA) (Fisher, 1936) aims to ﬁnd the projection matrix W 2 RDd (d 6 minðD; CÞ) such that the low-dimensional representations zi 2 Rd obtained by zi ¼ W T xi have maximum discriminative power between the classes. W is the matrix that maximizes the Fisher ratio (1): trðW T S b W Þ trðW T S w W Þ

ð1Þ

where S w and S b denote the within-class and between-class scatter matrices, respectively. These matrices can be obtained by accumulating pairwise scatter matrices, i.e., T ðxi xj Þðxi xj Þ to obtain (Sugiyama and Roweis, 2007; Yan et al., 2007): 1 XX w T Sw ¼ a ðxi xj Þðxi xj Þ 2 i j ij ( ð2Þ 1=nc if cxi ¼ cxj ¼ c ðsame classÞ w aij ¼ 0 if cxi – cxj ðdifferent classÞ Sb ¼ abij

¼

1 XX b T a ðxi xj Þðxi xj Þ 2 i j ij ( 1=n 1=nc if cxi ¼ cxj ¼ c 1=n

ð3Þ

if cxi – cxj

The elements awij and abij make up the aﬃnity matrices Aw and Ab , which determine the within-class and betweenclass scatter matrices. The entries in Aw and Ab are then interpreted in terms of weights in an adjacency graph (He and Niyogi, 2004). From Eqs. (2) and (3) it follows that all pairs of observations contribute to the aﬃnity matrices and that the weights of these contributions only depend on whether the data points belong to the same class or not. In terms of adjacency graphs, all samples from the same class are fully connected to form a complete global neighborhood graph for the within-class scatter. Similarly, pairs of observations from diﬀerent classes are fully connected to deﬁne the between-class scatter matrix. Since the classical FDA approach assigns equal weights to all pairs of data points (irrespective of their distance), it can be considered as a global approach. From the perspective of phone classiﬁcation, however, it is not evident that a fully global approach is optimal. As mentioned in Section 1, diﬀerent phones may have a diﬀerent neighborhood structure. The diﬀerent neighborhoods

H. Huang et al. / Speech Communication 76 (2016) 28–41

can be accounted for by modifying the aﬃnity matrices Aw and Ab in Eqs. (2) and (3) such that they can better capture and preserve the local structure. This can be accomplished by modifying the aﬃnity matrices and the corresponding adjacency graphs. 2.2. Local variants of LDA The idea of preserving local structure in linear dimensionality reduction by using an aﬃnity matrix was ﬁrst proposed in He and Niyogi (2004), and generalized in the graph-embedding framework in Yan et al. (2007). Local structure can be captured by assigning relatively larger weights to the connections of closer pairs, for example by deﬁning the weights awij , abij in Eqs. (2) and (3) monotonically decreasing with the distance between the members of a pair (instead of being constant as in LDA). With kxi xj k denoting the distances between pairs of observations this yields: 1 XX w Sw ¼ a ðxi xj Þðxi xj ÞT 2 i j ij ( w ð4Þ f ðkxi xj kÞ=mc if cxi ¼ cxj ¼ c w aij ¼ 0 if cxi – cxj Sb ¼ abij

¼

1 XX b T a ðxi xj Þðxi xj Þ 2 i j ij ( 0 if cxi ¼ cxj f b ðkxi xj kÞ=n

ð5Þ

if cxi – cxj

in which f w ðÞ and f b ðÞ are monotonically decreasing functions. For the purpose of normalization, the denominator mc in Eq. (4) is introduced, which stands for the number of graph-connected observations within the same class c. In global LDA variants mc becomes identical to nc , the number of all observations in class c. The coeﬃcients abij are set to zero when xi and xj are from the same class. This is diﬀerent from classical LDA (cf. Eq. (3)). The deﬁnition in Eq. (5) is based on the idea that the distances of two points from the same class should not impact the estimate of the between-class scatter.1 In the following subsections, we discuss the neighborhood properties of diﬀerent deﬁnitions of the adjacency graphs and of the weights of the connections in these graphs. 2.2.1. Connectivity around each point In constructing the adjacency graphs, one can basically choose between two options: It can be shown mathematically that the setting abij ¼ 0 in Eq. (5) leads to the same solution as conventional LDA with Eq. (3), under the condition that the class sizes are the same. Diﬀerences between the conventional LDA and the local LDA based on Eq. (5) are due to unequal class size. This is conﬁrmed by the results presented in Section 5. 1

31

Complete graph: Both Local Fisher Discriminant Analysis (LFDA) (Sugiyama and Roweis, 2007) and its extension in the form of Globality-Locality Consistent Discriminant Analysis (GLCDA) (Huang et al., 2011) use a complete adjacency graph. Thus, the functions f w and f b in Eqs. (4) and (5) always yield a (possibly small) positive weight for each pair of observations. Nearest neighbor (NN) graph: Local Discriminant Embedding (LDE) (Chen et al., 2005) constructs a local (partial) adjacency graph. Each data point xi is only connected to the Lw nearest neighbors from the same class and the Lb nearest neighbors in the other classes. Thus, the entire neighborhood of each point x is the w b union of two disjoint subsets N ðxÞ and N ðxÞ: w

w

N ðxÞ ¼ fzjcz ¼ cx ; kz xk < kz xL kg Lb

b

N ðxÞ ¼ fzjcz – cx ; kz xk < kz x kg

ð6Þ ð7Þ

w

where xL denotes the Lw -th nearest neighbor of x in the b subset of data points from the same class and xL the Lb -th nearest neighbor of x in the subset of data points that belong to a diﬀerent class. The functions f w and f b in Eqs. (4) and (5) are deﬁned such that w f w ðkz xkÞ ¼ 0 when z R N ðxÞ and f b ðkz xkÞ ¼ 0 b when z R N ðxÞ, eﬀectively eliminating the corresponding edges. Comparing the two approaches to construct the graph, the former uses the functions f w and f b to weigh all edges in a fully connected graph, while the latter directly deﬁnes the local structure by keeping only the connections between observations in a limited neighborhood. Both approaches deal with a trade-oﬀ between local and global structure, albeit in diﬀerent ways, using diﬀerent kernels. 2.2.2. Weighting the edges of the adjacency graph The functions f w and f b in Eqs. (4) and (5) can be deﬁned by means of diﬀerent kernels. The trivial kernel (also called ‘‘Simple-Minded Kernel”, He and Niyogi, 2004). This kernel assigns equal weights to all edges in the adjacency graph: f w ðkxi xj kÞ ¼ 1

ð8Þ

f ðkxi xj kÞ ¼ 1

ð9Þ

b

When applied to a fully-connected graph, this deﬁnition of the functions f w and f b yields an adjacency graph similar to the one in classical FDA: all points are connected and all connected pairs are considered equally important (Yan et al., 2007). Exponentially-Decaying Kernels. Many exponentiallydecaying kernels can be deﬁned. The authors in He and Niyogi (2004) proposed the so-called Heat Kernel as follows:

32

H. Huang et al. / Speech Communication 76 (2016) 28–41 2

f w ðkxi xj kÞ ¼ expðkxi xj k =tw Þ 2

f ðkxi xj kÞ ¼ expðkxi xj k =t Þ b

b

ð10Þ ð11Þ

Two parameters tw ; tb (tw ; tb > 0) are used to balance the inﬂuence of the global and local structure. Smaller values of these parameters result in larger weights for close pairs, making the graph more ‘‘local”. When one of these parameters goes to þ1, the corresponding function (f w or f b ) will approximate 1, which means that all data pairs are considered equally important. If both f w and f b approximate 1, the resultant kernel approximates the Simple-Minded Kernel. The authors in Zelnik-Manor and Perona (2004) replaced the class-independent ‘‘t” parameters in Eqs. (10) and (11) by class-dependent scaling parameters ri and rj , arriving at: 2

ð12Þ

2

ð13Þ

f w ðkxi xj kÞ ¼ expðkxi xj k =rwi rwj Þ f b ðkxi xj kÞ ¼ expðkxi xj k =rbi rbj Þ rwp

¼

kxwp

xkp

w

k

ð14Þ

b

rbp ¼ kxbp xkp k w

Fig. 1. Local weights as a function of between-token distance, according to four diﬀerent combinations of graph construction methods and kernels. The data set is an artiﬁcial set consisting of 100 randomly generated observations from a two-dimensional normal distribution. For an explanation of the four diﬀerent curves see the text.

ð15Þ b

where xkp and xkp ; p ¼ i or j, denote the k w -th nearest neighbor of xp from the same class and the k b -th nearest neighbor of xp in any other class. The normalization by the r’s makes the weights of the edges dependent on the density of the neighborhood of xi . However, Eqs. (12) and (13) do not allow weighting the local and global structure diﬀerently, as in the Heat Kernel. To enable using diﬀerent weighting of local and global structure, we introduce a novel Adaptive Kernel, as a generalization of the Heat kernel, by introducing two exponents cw and cb : ! 2 kxi xj k w f ðkxi xj kÞ ¼ exp w w cw ð16Þ ðri rj Þ ! 2 kxi xj k b ð17Þ f ðkxi xj kÞ ¼ exp cb ðrbi rbj Þ

ﬁed artiﬁcial example for diﬀerent combinations of the graph construction method (fully connected vs. Nearest Neighbor) and the kernel (‘‘Simple-Minded” kernel vs. exponentially decaying kernel). To that end, a set of 100 points was randomly drawn from a two-dimensional Gaussian distribution. The number i on the horizontal axis in Fig. 1 refers to the ith-nearest neighbor measured from the mean of the set; along the vertical axis, the output of a number of the four diﬀerent kernels is displayed. The graph construction methods and kernels shown are:

where the parameters cw and cb play a similar role as the tw and tb in the Heat kernel. Both the parameters cw ; cb (in Eqs. (16) and (17)) as well as the parameters k w ; k b (in Eqs. (14) and (15)) control the balance of local and global structure. If the values of k or c increase, the balance is shifted from the local structure towards the global structure of the data, and vice versa. There is no value of k w and k b for which rwi rwj and rbi rbj approximate 1 in Eqs. (12) and (13). The reduction to the Heat Kernel with tw ¼ tb ¼ 1 can only be achieved by setting both cs to zero.

Complete Graph with Simple-Minded Kernel: This combination is depicted by the dotted line (black) in Fig. 1. In this setting all pairs obtain the same weight, regardless of the distance between the members. For the within-class scatter this setting is identical to classical LDA (Fisher, 1936). Complete Graph with Exponentially-Decaying Kernel: The behavior of the Heat Kernel with tw ¼ 1:5 in Eq. (10) is shown by the dashed-dotted line (red2). It can be observed that Exponentially-Decaying Kernels may result in a relatively heavy tail. Nearest Neighbor Graph with Simple-Minded Kernel: The dashed line (black) represents the output of the Simple-Minded Kernel after putting Lw in Eq. (6) (determining the size of the within-class adjacency graph) to 40. As a result the weights in the NN graph diﬀer from those in the complete graph after the discontinuous transition at the 40th sample.

2.2.3. An artiﬁcial example The impact of the diﬀerent kernels is diﬃcult to present in a visual way. Fig. 1 shows these diﬀerences in a simpli-

2 For interpretation of color in Fig. 1, the reader is referred to the web version of this article.

H. Huang et al. / Speech Communication 76 (2016) 28–41

33

Fig. 2. Illustration of the integrated classiﬁer. For all classiﬁers, the cascade consists of three diﬀerent steps. The ﬁrst step is the PCA step, which is present for all classiﬁers. The second step is either the conventional LDA, either one out of six diﬀerent Dimensionality Reduction algorithms (extensions of LDA/ FDA), or the baseline in which there is no LDA step. The third (ﬁnal) step consists of a weighted kNN (WkNN). The parameters on the arrows show the model parameters that must be optimized. In all cases, the parameters k; s of the WkNN in the third step must be optimized as well.

Nearest Neighbor Graph with Exponentially-Decaying Kernel: this combination (solid line, red) emphasizes the importance of the closer neighbors. The weights start decreasing monotonically, and drop to zero at the 40th sample.

3. The integrated classifier The classiﬁer used for assigning a phone class to each input vector is shown in Fig. 2. The classiﬁer is composed of three modules, connected in a pipeline. First, as in the classical approach in classiﬁcation tasks with very high-dimensional features (Fidler et al., 2006; Sha, 2007; Halberstadt, 1998), Principal Component Analysis (PCA) (Hotelling, 1933) is performed on the input vectors for orthogonalization and removal of redundancy. We keep the 150 dimensions that together account for almost all of the variance in the data. In Yang and Yang (2003) it is shown that the PCA transformation does not eliminate information that helps to discriminate between the classes. Recent classiﬁcation techniques, based on Deep Neural Networks (DNN) (Hinton et al., 2012), avoid this PCA step. This might question the importance or necessity of the PCA step. In our approach, the most important goal of the PCA step is dimension reduction and avoidance of redundancy. In general, it depends on the data whether a PCA step before an LDA is desirable. If the number of points is comparable to the number of dimensions, or smaller, a single-step LDA will usually overﬁt. Reducing the

dimensionality with PCA before LDA may prevent overﬁtting (the PCA step thereby acting as a regularization step) and may therefore increase the LDA stability and performance. Importantly, the PCA step increases the scalability of the classiﬁcation problem when the embedding dimension gets very large; the class separating LDA-step can then be formulated in terms of a lower and more functional dimensionality due to the PCA step. It is useful to realize that also in the DNN approaches it is important to eﬀectively deal with redundancy and eﬃciency in feature extraction. In these network approaches, the use of bottleneck features constitute an alternative way to serve this purpose. In the second step, the dimension of the PCA output feature vectors is further reduced (to 47, i.e. the number of phone classes minus 1) by means of one of the six dimensionality reduction algorithms (all FDA/LDA-extensions, two types of connectivity, and three kernels) described in Section 2.2.2 and shown in Fig. 2. The dimensionality reduction by means of LDA is supposed to enhance the separation between the 47 phone classes in the features space. To provide a baseline, we include a seventh option, which consists of not using any LDA transformation or other dimensionality reduction in the second step. In the third and ﬁnal step, the reduced feature vectors are classiﬁed using a common back end, which is a weighted k-Nearest Neighbor classiﬁer. The k-NN classiﬁer is commonly used to evaluate the performance of dimensionality reduction algorithms (e.g. Sugiyama and Roweis, 2007; Yan et al., 2007). The weights in the k-NN classiﬁer are determined by a Heat Kernel, which is used

34

H. Huang et al. / Speech Communication 76 (2016) 28–41

to reduce the impact of points that are not in the direct neighborhood of an unlabeled data point. The label that is assigned to an unlabeled new point x is determined by a weighted majority vote among the labels in the neighborhood of x. If x1 ; x2 ; . . . ; xk denote the k nearest neighbors of the observation x, then the weights of these k neighbors are speciﬁed as follows: ! 2 kxi xk ; i ¼ 1; 2; . . . ; k ð18Þ wi ¼ exp s with k denoting the number of nearest neighbors considered and s a scaling parameter that determines the impact of the distance between points on the weight. After evaluating Eq. (18) on all k points in the neighborhood of x; x is assigned the label of the class whose members yield the highest accumulated score (weighted majority score). 4. Data and acoustic processing 4.1. Data: TIMIT We compare the local and global dimensionality reduction approaches by means of a phone classiﬁcation task using the TIMIT (Garofolo, 1988) corpus. Although TIMIT is over 20 years old, it currently is still one of the most frequently used speech corpora for the purpose of phone classiﬁcation and identiﬁcation, as evidenced by the large number of papers at recent speech conferences that use TIMIT as the primary test bed for phone classiﬁcation. TIMIT uses 61 diﬀerent labels that cover all phones of American English. To evaluate the acoustic models, we mapped the 61 phone labels to 48 phones (in the same way as in Lee and Hon (1989)). The data set used to train all the classiﬁers is the standard NIST training set, which includes 462 speakers, 3696 utterances, and 139,852 phones. We also use the development set exactly as proposed in Halberstadt (1998), which comprises 50 speakers, 400 utterances, and 15,038 phone tokens. This set is referred to as set D. The core test set, containing 24 speakers, 192 utterances, and 7196 phone tokens, will be referred to as set C. Most papers on TIMIT phone classiﬁcation report on the performance on set C, while using set D as development for tuning purposes. As will be explained in more detail in Section 5, we will also report results in which the roles of sets C and D are swapped, so that set C is used as a development set and set D is used as the independent test set. By doing so, we investigate the sensitivity of the diﬀerent LDA-extensions to a speciﬁc data set. 4.2. Data preprocessing To generate the basic feature vectors a Short-Time Fourier Transformation is performed every 10 ms with a 25 ms Hamming window. The spectra are then converted to 13 MFCCs (c0 ; c1 ; . . . ; c12 ). To capture information from

the articulatory context, 23 frames are concatenated around the center frame of each phone. As a result, each segment is represented by the 11 preceding frames, the center frame itself, and 11 succeeding frames, which results in 13 23 ¼ 299-dimensional feature vectors. We determined the maximum and minimum values of the 13 coeﬃcients across the full training database and then used these values to map the coeﬃcients into the interval ½0; 1. Next, the 299 dimensional features are orthogonalized and part of the redundancy is removed by means of Principal Component Analysis (PCA) (Fidler et al., 2006). The ﬁrst 150 eigenvectors account for 97% of the variance in the original 299dimensional vectors. The MFCC frames in the development and test sets were mapped into the (approximate) interval ½0; 1 using the normalization coeﬃcients obtained with the training set. Subsequently, the 299-dimensional stacks of normalized frames, centered around the middle of the segments, were projected into a 150-dimensional space by means of the PCA matrix obtained with the training data. The resulting 150-dimensional feature vectors, together with their corresponding labels oﬀered by TIMIT, were then used to evaluate the eight dimensionality reduction algorithms shown in Fig. 2. 5. Experiments In this section we present a number of experiments. In Section 5.1 we discuss an explorative grid search that was applied to determine the optimal parameter setting for each classiﬁer. The criterion in the search was the eventual classiﬁcation performance. In Section 5.2 we present and discuss the classiﬁcation results obtained with the diﬀerent dimensionality reduction methods. In Section 5.3 we investigate possible diﬀerences of the impact of dimensionality reduction on classiﬁcation performance in diﬀerent broad phonetic classes. 5.1. Explorative grid search As can be seen from Fig. 2, most of the classiﬁcation approaches used here depend on two or more parameters. In order to optimize these, we tuned the parameters of the LDA extensions summarized in Fig. 2, in conjunction with the parameters (k; s) of the WkNN classiﬁer. For this parameter optimization, we used a two-step strategy. In the ﬁrst step, a coarse grid was applied to determine the upper and lower bounds of each of the parameters between which the optimal parameter values could be expected. In the second step, we performed a search on a ﬁner grid based on the lower-/upper-bounds found in the ﬁrst step. The ﬁne grid search was skipped if the results for the coarse grid points did not show signiﬁcantly diﬀerent values. Traditional LDA and Complete Graph with SimpleMinded Kernel: except for k; s in the weighted k-NN, these methods do not require any parameter optimization. For the optimization of k and s, see below.

H. Huang et al. / Speech Communication 76 (2016) 28–41

35

the most promising intervals. With smaller step size for s equal to 0.25 the performance of all LDA variants under study appeared not to vary signiﬁcantly. Therefore, we chose k ¼ 25 and s ¼ 4:5 for the weighted k-NN classiﬁer in the remainder of this paper. The optimal parameter values obtained from this grid search are summarized in Table 1 for optimization using set D and in Table 2 for optimization using set C.

Complete Graph with Heat Kernel: in order to balance the importance of the local and global structure, the parameters tw and tb were ﬁrst coarsely investigated using a range 0 6 tw , tb 6 5 with a step size 0.25 for both parameters. The ﬁne grid was deﬁned as 1:5 6 tw 6 2:5, 0:8 6 tb 6 2:2, using a step size 0.05 for both parameters. Complete Graph with Adaptive Kernel: The coarse grid search was done for cw and cb in the range 0 6 cw , cb 6 2 with a step size 0.05 for both parameters. Next, the ﬁne grid search focused on the intervals 0:9 6 cw 6 1:5 (using a step size of 0.02) and 0:3 6 cb 6 1:3 (using a step size of 0.05). As mentioned in Section 2.2.2, the Adaptive kernel contains two additional parameters k w and k b , which also aﬀect the trade-oﬀ between local and global structure. In Huang et al. (2011) we found that the best performance could be obtained with optimizing the cs with ﬁxed values of the two k parameters. Therefore, we set k w ¼ 7 (as recommended in Zelnik-Manor and Perona (2004)) and k b ¼ 50, based on the ratio of total number of tokens and individual classes. Nearest Neighbor Graph with Simple-Minded Kernel: The parameters Lw and Lb in Eqs. (6) and (7) specify the number of nearest neighbors to be included in the adjacency graph for the within-class and between-class scatter, respectively. To be able to discover the impact of small local neighborhoods, we explored the range of Lw ; Lb starting from small natural numbers and it was found that 5 6 Lw 6 16, 10 6 Lb 6 100 (using step sizes of 1 and 5 respectively) were the relevant ranges. Nearest Neighbor Graph with Heat Kernel: A coarse grid search was deﬁned by 0 6 tw , tb 6 5 (with a step size 0.25 for both parameters). No signiﬁcant diﬀerences were obtained, so the ﬁne grid search was skipped. Nearest neighbor Graph with Adaptive Kernel: In this case, the coarse grid was deﬁned by 0 6 cw , cb 6 2 with a step size 0.05, while k w ¼ 7 and k b ¼ 50 were ﬁxed. Also in this case, the coarse grid did not give rise to a ﬁne search in a restricted interval.

5.2. Performance comparison: phone classiﬁcation accuracy In this section, we compare the classiﬁcation methods shown in Fig. 2. We will do so by comparing the classiﬁcation accuracies that are obtained by using the diﬀerent methods in combination with the parameter settings as speciﬁed in Tables 1 and 2. The results obtained with optimizing on set D and testing on the core test set C are shown in the column ‘‘Set C” in Table 3; the results obtained with optimization on set C and testing on set D are in the column ‘‘Set D”. The top row in that table can be considered as a baseline performance, which is obtained with the 150dimensional PCA vectors – that is, without any discriminative dimensionality reduction. From this table, a number of conclusions can be drawn. Firstly, it can be seen that all dimensionality reduction methods yield much higher accuracy than a plain weighted k-NN classiﬁcation based on the 150-dimensional vectors retained after PCA. This shows that reducing the dimension of the high-dimensional feature vector of MFCC stacks leads to improved results in all cases. Second, the diﬀerence between conventional LDA and Local LDA with the complete graph and Simple-Minded Kernel is negligible. Thus, the diﬀerence between the deﬁnition of the between-class scatter in Eqs. (7) and (5) does not seem to have a substantial impact in our data sets. This suggests that the graph-based LDA is primarily determined by the local acoustic structure. Third, it appears that the methods using exponentially-decaying kernels outperform LDA with the Simple-Minded Kernel, again suggesting that the use of local structure of the data is beneﬁcial. Fourth, when swapping the roles of set C and set D, that is, testing on set D and tuning on set C, the performance of all three NN Graph methods is signiﬁcantly better than the best Complete Graph method. Despite the fact that the diﬀerences between the Complete Graph and the NN Graph methods on set C are not statistically signiﬁcant, we see a similar trend: Limiting the adjacency graph to a relatively small

5.1.1. Weighted k-NN classiﬁer For the weighted k-NN classiﬁer two parameters (k; s) must be optimized. For this classiﬁer, the classiﬁcation performance was explored on a coarse grid deﬁned by the parameter ranges 1 6 k 6 60 and 1 6 s 6 8, using a step size of 1 for both k and s. From this search we concluded that the parameter ranges 15 6 k 6 40 and 3 6 s 6 7 were

Table 1 Eventual optimal parameters settings (based on using a ﬁne grid search) for each local method obtained by optimization on set D. Algorithm

tw

tb

cw

cb

Lw

Lb

Complete Graph with Heat Kernel Complete Graph with Adaptive Kernel NN Graph with Simple-Minded Kernel NN Graph with Heat Kernel NN Graph with Adaptive Kernel

1.85 – – 1.50 –

1.45 – – 2.50 –

– 0.94 – – 1.10

– 1.22 – – 1.30

– – 12 10 12

– – 60 60 55

36

H. Huang et al. / Speech Communication 76 (2016) 28–41

Table 2 Eventual optimal parameters settings (based on using a ﬁne grid search) for each local method obtained by optimization on set C. Algorithm

tw

tb

cw

cb

Lw

Lb

Complete Graph with Heat Kernel Complete Graph with Adaptive Kernel NN Graph with Simple-Minded Kernel NN Graph with Heat Kernel NN Graph with Adaptive Kernel

2.15 – – 1.25 –

1.35 – – 2.50 –

– 1.16 – – 1.40

– 1.02 – – 1.25

– – 8 11 12

– – 75 55 85

Table 3 Classiﬁcation accuracies. The column ‘‘Set C” shows the accuracy on the core test set C, using set D for development. The column ‘‘Set D” shows the accuracy when testing on set D, and using set C for development. Algorithm

Set D

Set C

150 dimensions after PCA Traditional LDA Complete Graph with Simple-Minded Kernel Complete Graph with Heat Kernel Complete Graph with Adaptive Kernel NN Graph with Simple-Minded Kernel NN Graph with Heat Kernel NN Graph with Adaptive Kernel

68.92 ± 0.74 74.26 ± 0.70 74.03 ± 0.70 75.01 ± 0.69 75.30 ± 0.69 76.93 ± 0.67 76.99 ± 0.67 76.91 ± 0.67

68.12 ± 1.08 73.66 ± 1.02 73.49 ± 1.02 74.82 ± 1.00 74.97 ± 1.00 75.58 ± 0.99 75.56 ± 0.99 75.61 ± 0.99

Fig. 3. Surface plot pertaining to set D. The ﬁgure shows the performance gain of the NN graph method with the Simple-Minded Kernel over the LDA method as a function of Lw and Lb . Numbers on the z-axix are proportions. For a more detailed description see the text.

neighborhood both for within and between class data points is a more eﬀective way of capturing local structure than exponentially decreasing the weights of the distances in the complete graph. Also, the fact that in both test sets the performance with the Simple-Minded Kernel in the NN Graph methods does not diﬀer signiﬁcantly from the performance with the exponentially decaying kernels corroborates the ﬁnding that limiting the graph is more eﬀective than weighting the distances. 5.2.1. The impact of Lw and Lb in the Nearest Neighbor Graph Table 3 shows that the NN-graph methods provide the best performance. More interesting than the optimal results

themselves may be the insight in how sensitive the performance of the cascaded classiﬁcation systems is to changes in the trade-oﬀ between local and global information. The decisive parameters in these methods are Lw and Lb , that are used to construct the nearest neighbor sets around each point. Therefore, we investigated the impact of these parameters on the performance in more detail. Fig. 3 shows a surface plot of the performance on set D (the condition in which the advantage of the NN method over classical LDA and the Complete Graph methods was largest) as a function of Lw (on the X-axis) and Lb (on the Y-axis). On the Z-axis (vertical) is shown the diﬀerence in classiﬁcation accuracy between the NN-method with Simple-Minded Kernel and the conventional LDA

H. Huang et al. / Speech Communication 76 (2016) 28–41

method. To aid interpretation, the ﬁgure includes two additional transparent horizontal planes. The lower one is located at z ¼ 0, and represents the performance level of the conventional LDA method. The upper plane is introduced to make the regions with superior performance visually conspicuous. In addition, contour plots are shown on the basis plane. Both the contour plots and the regions where the performance is signiﬁcantly better than classical LDA in Fig. 3 show that the best performance is reached for values of Lw and Lb that are roughly speciﬁed by the relation Lb 5 Lw , with a minimum value of Lw ¼ 7. This relation can be interpreted in an interesting manner: it is not surprising that the between-class neighborhood should be substantially larger than the within-class neighborhood. It can be seen that the performance of the NN Graph method drops below classical LDA when either neighborhood becomes very small (meaning that only very few data points are considered) or when the size of the betweenclass neighborhood is no longer substantially larger than the within-class neighborhood. 5.2.1.1. Stability of the optimal parameters. Tables 1 and 2 suggest both similarities and diﬀerences between the parameter values that yield optimal results when using set D or set C for parameter tuning. As can be seen from the intervals in which no signiﬁcant performance diﬀerences were observed for the parameters tw , tb , cw , cb in the exponentially decaying kernels, the diﬀerences between the optimal values for these parameters are probably due to idiosyncratic numerical properties of the data sets. With respect to the values of Lw , Lb in the NN Graph methods it might seem that there is a real diﬀerence between the two sets, especially for the NN Graph with SimpleMinded Kernel, where the best result suggests that Lb

37

should be almost 10 times as large as Lw for set C. However, as can be seen from Fig. 4, there seem to be two regions in which a performance that is signiﬁcantly better than the classical LDA can be obtained: One region is similar to what we have seen in set D (i.e., optimal performance if Lb 5 Lw , and another region at relatively high values of Lb , combined with small values of Lw ). We suspect that the ﬁnding that the absolute maximum happened to be in the region with high values of Lb is caused by the relatively small size of the test set. To check this, we performed a grid search for the optimal values of Lw , Lb and the Simple-Minded Kernel for the combined test sets. The results (not shown here) indeed clearly conﬁrmed that overall the best performance is achieved for Lb 5 Lw . 5.3. Performance comparison: broad phonetic class confusion The overall accuracy is a rather crude way to assess classiﬁcation methods. Speciﬁcally, overall performance does not allow us to draw conclusions about the potential role of manifolds in the data space that might be modeled more accurately by LDA variants which do take local structure into account. Therefore, we investigate how a local method performs in distinguishing broad phonetic classes and in separating phones within the individual broad classes. This analysis should show how local methods trade betweenbroad-class accuracy (a problem in which local manifolds are probably not very important) for within-broad-class accuracy (where local structure might well be important). As a representative of the local methods, the NN-graph combined with the Simple-Minded Kernel is chosen, because of its simplicity and superiority over other local methods. To enhance comparability with previous results

Fig. 4. Surface plot pertaining to set C. The ﬁgure shows the performance gain of the NN graph method with the Simple-Minded Kernel over the LDA method as a function of Lw and Lb . For a more detailed description see the text.

38

H. Huang et al. / Speech Communication 76 (2016) 28–41

Table 4 Confusion matrix deﬁned in terms of broad phonetic classes. Seven broad classes have been deﬁned: plosives (PL), strong fricatives (SF), weak fricatives (WF), nasals/ﬂap (NS), semi-vowels (Se-V), short vowels (Sh-V) and long vowels (Lo-V). The accuracies are provided in terms of percentages. Each cell contains two accuracies: the ﬁrst result is based on the local method (NN graph and Simple-Minded Kernel), the second result (between parentheses) is based on conventional LDA (the number with ± refers to the 95% conﬁdence interval). The columns refer to the ‘real’ reference label while the rows refer to the hypothesized label, respectively. In each diagonal cell, the bold number refers to the best classiﬁer in that cell. All silences have been excluded from the analysis.

PL SF WF NS Se-V Sh-V Lo-V

PL

SF

WF

NS

Se-V

Sh-V

Lo-V

96.1 (96.2 ± 1.2) 1.8 (1.3) 1.1 (1.3) 0.8 (1.8) 0.1 (0.0) 0.0 (0.1) 0.1 (0.1)

1.6 (1.8) 95.2 (95.2 ± 1.6) 3.1 (2.8) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.2)

4.1 (4.3) 2.4 (4.3) 88.4 (84.2 ± 4.4) 0.7 (1.4) 3.8 (4.7) 0.3 (0.7) 0.3 (0.4)

0.2 (0.3) 0.0 (0.0) 1.4 (2.1) 95.1 (92.6 ± 1.7) 2.2 (2.9) 0.5 (1.7) 0.7 (0.3)

0.6 (0.5) 0.0 (0.0) 0.8 (0.7) 2.3 (2.7) 85.2 (81.9 ± 2.4) 4.7 (5.4) 6.4 (8.3)

0.2 (0.2) 0.0 (0.0) 0.2 (0.2) 1.4 (1.9) 2.0 (3.0) 85.8 (83.9 ± 1.9) 11.1 (11.5)

0.5 (0.4) 0.0 (0.0) 0.5 (1.0) 1.8 (1.8) 10.8 (11.4) 17.0 (18.1) 69.3 (67.3 ± 2.9)

in the literature, we perform the analysis for the core test set (set C), using the optimal parameter values Lw ¼ 12, Lb ¼ 60 as in Table 1. The classiﬁcation results are presented in two tables. In Table 4 the 48 phone classes are categorized into seven broad phonetic classes (Halberstadt and Glass, 1997; Reynolds and Antoniou, 2003): plosives (PL), strong fricatives (SF), weak fricatives (WF), nasals (NS), semi-vowels (Se-V), short vowels (Sh-V) and long vowels (Lo-V). In Table 4, each cell contains the result of the NN Graph method and (in parentheses) the result of the traditional LDA. The diagonal also shows the 5% two-sided conﬁdence interval in the binomial distribution. Columns represent the reference label (true label) and the rows represent the label assigned by the classiﬁers. The cells on the diagonal represent correct classiﬁcation rates of the broad phonetic classes, while the oﬀ-diagonal cells refer to the confusions between diﬀerent broad phone classes. The table shows that for nearly all broad phone classes, the NN Graph method outperforms the conventional LDA method, with the plosives as an exception. The improvements for NS, Se-V and Sh-V are signiﬁcant, while the improvement for WF and Lo-V approach signiﬁcance. The three classes that show signiﬁcant improvement have in common that they contain voiced phonemes which are characterized by substantial dynamic changes in the 230 ms interval covered by the 23-frame blocks. Nasals are characterized by changes in the vocal tract shape as well as changes in the degree of acoustic coupling with the nasal tract. Semi-vowels are also characterized by opening/closing movements of the vocal tract, even if no full closure is obtained. Short vowels, which have an average duration of no more than 100 ms, will have full transitions from the preceding and into the succeeding consonant in the 230 ms interval. The absence of a signiﬁcant improvement for the long vowels, which include the inherently dynamic diphthongs, may be due to the fact that the diphthongs make up less than half of the tokens in that class. If the gain in the NN-Graph method is related to a more accurate representation of the dynamics in a 230 ms interval, one would also expect a substantial improvement for the plosives. However, if anything, the performance for the plosives

deteriorates (albeit non-signiﬁcantly). It remains to be seen whether this is due to the fact that the plosives do not form ‘compact’ manifolds; it might also be because MFCC coefﬁcients, which are derived from 25 ms windows, do not capture the details of the extremely rapid transitions in the releases. MFCCs may also miss part of the relevant detail in fricatives, where a larger proportion of the information is captured by the residual signal that is no longer available. Table 5 contains the results for within-class confusions. The error rate (in percentage) within the individual broad phonetic classes obtained with the NN-Graph classiﬁer is shown in the upper row; the classiﬁcation performance obtained with the classical LDA is shown in the bottom row, together with the p-values obtained from a McNemar test of the signiﬁcance of the diﬀerences between the two LDA variants. There are marginal and non-signiﬁcant differences in both directions. The only signiﬁcant diﬀerence is found for the strong fricatives. A detailed look at the data showed that the voiced/voiceless confusion between /s/ and /z/ is reduced when using the NN graph: speciﬁcally /z/ is less often classiﬁed as /s/. 6. Discussion The main objective of our study was to investigate to what extent knowledge about the distributions of the acoustic representation of phones – expressed in the form of neighborhood structure or manifolds – might be exploited for speech signal processing. Our goal is to gain understanding, rather than developing a particular step in an ASR processing cascade with minimization of error rates as single aim. To that end, we applied a frequently used, admittedly small-scale phone classiﬁcation task. In addition, we investigated whether a newly designed extension of a Linear Discriminant Analysis (LDA) variant outperforms the less elaborated LDA variant on which it was based. It appears that reducing the original 299-dimensional acoustic space to 47 dimensions by means of a cascade of PCA and conventional LDA yields a signiﬁcant improvement over reducing the original acoustic space to 150

H. Huang et al. / Speech Communication 76 (2016) 28–41

39

Table 5 Phone error rates (in percentages) per broad phonetic class. Top row: NN graph with Simple-Minded Kernel; bottom row: conventional LDA. The numbers in parentheses in the bottom row are p-values obtained by a McNemar test. Method

PL

SF

WF

NS

Se-V

Sh-V

Lo-V

NN graph LDA

18.9 17.4 (0.22)

14.4 17.4 (0.04)

4.7 4.4 (0.75)

17.9 20.5 (0.15)

7.7 9.8 (0.06)

29.2 28.8 (0.48)

7.2 8.2 (0.13)

dimensions by means of PCA alone. In addition, the graphbased LDA algorithms used in this paper were able to obtain a substantial additional gain in performance compared to using conventional LDA. The diﬀerence between conventional LDA and graph-based LDA with the fullyconnected graph and Simple-Minded Kernel is negligible, but with the two exponentially-decaying kernels (Heat and Adaptive) performance did improve. The best results were obtained with the LDA variants that use only small local neighborhoods. However, there is no diﬀerence between the performance with the three kernels used for weighting the neighborhood distances: the ‘SimpleMinded’ Kernel, the Heat Kernel and the Adaptive Kernel. Thus, it seems that the added complexity involved in the newly proposed Adaptive kernel cannot be justiﬁed by an improved accuracy. However, we can argue that the stability of the improvement for the small local-neighborhood variant of the LDA across diﬀerent kernels now convincingly shows that the exact type of metric used in graphbased LDA is less relevant than the distinction between global and local information. Fig. 1 provides a clue for explaining this result. When using relatively small neighborhoods, as in this paper, the diﬀerence between equal weights and exponentiallydecaying weights may turn out to be too small to have a signiﬁcant eﬀect. It remains to be seen whether the Heat kernel and/or the Adaptive kernel can improve performance in tasks where the optimal neighborhood size is much larger. Another way for formulating the explanation for the small diﬀerence between the diﬀerent kernels in the case of NN graphs is that the locality information is already sufﬁciently captured by using partial neighborhoods in the NN graph construction. An additional cause for the failure of the exponentially decaying kernels to yield improvements may be the density of the population of the samples. The value of the decaying exponentials as a function of distance from a speciﬁc point x decreases smoothly and slowly for the optimal values of the parameters (cf. Tables 1 and 2). Therefore, the edge weights obtained by using an Exponentially-Decaying kernel (Heat Kernel and Adaptive Kernel) do not diﬀer much from those based on a SimpleMinded Kernel, especially if the neighborhood is densely populated. We investigated the performance of algorithms in two cases, in which the role of the set C and set D were swapped. The performance gain achieved for set D is larger than that for the TIMIT core test set (set C). This diﬀerence is most likely due to the diﬀerences in design of these sets. All sentences in the core test (set C) are diﬀerent from the

sentences in the training set and the development (set D) (Halberstadt, 1998). This makes for optimal independence between the test material on the one hand and the training/tuning material on the other. However, there is some overlap between the sentences in the training set and the development set. Therefore, the test set is not completely independent of the training material (even if it does not overlap with the material in set C used for tuning). Moreover, set D is twice as large as set C, leading to more narrow conﬁdence intervals in the statistical analyses. At the same time, it is encouraging to see that the optimal values of the parameters in the dimensionality reduction methods do not diﬀer greatly between the two sets. This strengthens the belief that the local dimensionality reduction methods model structure in the acoustic space that is related to speech production, rather than simply better modeling the details of a speciﬁc data set. An analysis of the phone-phone confusions indicates that preserving the neighborhood structure of a small number of nearest neighbors helps to achieve a higher classiﬁcation accuracy. This ﬁnding is potentially useful for an alternative approach for discriminative acoustic modeling in automatic speech recognition systems. Discriminative training in ASR mostly refers to modeling techniques in which acoustic models obtained by means of EM training are updated to minimize recognition errors. The local LDA approaches presented in this paper show how neighborhood structure in the feature space can be exploited to obtain posterior probabilities of the sub-word units, which might replace the likelihoods obtained with more conventional discriminative acoustic models. Our attempts to harness the manifold structure in a high-dimensional acoustic space method by means of LDA extensions is related to approaches based on Neighborhood Components Analysis (NCA). From the results in Singh-Miller et al. (2007), it might be inferred that NCA can yield a slightly better classiﬁcation performance than manifold learning by means of graph-based LDA. However, NCA leaves us with a single transformation matrix that is optimized in terms of class separation, but it is diﬃcult to interpret this matrix in terms of the neighborhood structure in the data, which was one of the goals of our research. In addition, our analysis of the confusions between and within broad phonetic classes suggest that a single transformation matrix is not an optimal solution. Our ﬁndings are in line with ﬁndings about the neighborhood structure in the acoustic data (and the phonedependency thereof) as reported in Jansen and Niyogi (2013).

40

H. Huang et al. / Speech Communication 76 (2016) 28–41

From the very start we intended to use phone classiﬁcation accuracy as the criterion measure. This explains the focus on LDA-based algorithms for dimensionality reduction. However, there is substantial overlap between the 48 phones in the acoustic space. As a consequence, attempts to discriminate between indiscriminable labels may not be the best way for discovering the underlying manifold structure in acoustic representations of speech signals. Therefore, it is worth investigating whether unsupervised methods for discovering lower-dimensional manifolds such as mixtures of principal component analyzers (Tipping and Bishop, 1999) or factor analyzers (Chen et al., 2010) might in the end provide a more accurate approximation of the manifold structure. The resulting low-dimensional representations might provide a more versatile input for some classiﬁer that is trained to do phone classiﬁcation. 7. Conclusions Graph-based extensions of LDA are able to model local structure of data embedded in a high dimensional feature space. We showed the impact of several extensions of LDA that make it possible to capture local structure in the acoustic feature space for the purpose of phone classiﬁcation – the local structure is captured by using neighborhood graphs in the dimensionality reduction algorithms. We compared two methods for constructing the graphs (fully or partially connected) and three methods for weighting the edges (uniform weights and two ways for making the weights inversely dependent on the distance between observations). The recognition tokens (phones in the TIMIT database) were represented by stacks of 23 consecutive 13-dimensional feature vectors centered around the phone’s midpoint as speciﬁed by the phone segmentation. We found that local dimensionality reduction yields a small but signiﬁcant improvement over conventional LDA. On the one hand, local methods using a nearest neighbor graph (NN-graph) outperformed methods that account for local structure by applying exponentially decaying weights to the edges in a fully connected graph. On the other hand, the performance of the NN-graph methods could not be improved by replacing the uniform kernel by an exponentially decaying kernel. Further analysis on the NN-graph approach has shown that near-optimal performance can be achieved for several combinations of the two important parameters, the numbers of nearest neighbors for within-class and betweenclass scatters. To that end, the results show that the number of between-class neighbors should be about ﬁve times larger than the number of within-class neighbors. The improvement obtained by local dimensionality reduction is not equal for all broad phonetic classes. This suggests that the local manifold structure depends on the phonetic class, and most probably class-dependent dimensionality reduction methods can further improve phone classiﬁcation performance.

Acknowledgements The authors would like to thank Jort F. Gemmeke (then aﬃliated to Katholieke Universiteit Leuven) and Yang Liu (Yale University) for their helpful suggestions on this paper. The research of Heyun Huang received funding from European Community’s Seventh Framework Programme [FP7] Initial Training Network SCALE, under Grant agreement No. 213850. Louis ten Bosch received funding from the FP7-SME project OPTI-FOX, project reference 262266. References Abramson, I.S., 1982. On bandwidth variation in kernel estimates a square root law. Ann. Statist. 10 (4), 1217–1223. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U., 1999. When is nearest neighbor meaningful? In: Proceedings of the International Conference on Database Theory, pp. 217–235. Burget, L., 2004. Combination of speech features using smoothed heteroscedastic linear discriminant analysis. In: Proceedings of Interspeech, pp. 2549–2552. Chen, H.-T., Chang, H.-W., Liu, T.-L., 2005. Local discriminant embedding and its variants. In: Proceeding of Computer Vision and Pattern Recognition, pp. 846–853. Chen, M., Silva, J., Paisley, J., Wang, C., Dunson, D., Carin, L., 2010. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: algorithm and performance bounds. IEEE Trans. Signal Process. 58 (12), 6140–6155. De Wachter, M., Matton, M., Demuynck, K., Wambacq, P., Cools, R., Van Compernolle, D., 2007. Template-based continuous speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 15, 1377– 1390. Erdogan, H., 2005. Regularizing linear discriminant analysis for speech recognition. In: Proceedings of Interspeech, pp. 3021 – 3024. Fidler, S., Skocaj, D., Leonardis, A., 2006. Combining reconstructive and discriminative subspace methods for robust classiﬁcation and regression by subsampling. IEEE Trans. Patt. Anal. Mach. Intell. 28, 337–350. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188. Frankel, J., Wester, M., King, S., 2007. Articulatory feature recognition using dynamic Bayesian networks. Comp. Speech Lang. 21 (4), 620– 640. Garofolo, J.S., 1988. Getting Started with the DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD. Gemmeke, J., Virtanen, T., Hurmalainen, A., 2011. Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio, Speech Lang. Process. 99, 2067–2080. Gish, H., Ng, K., 1996. Parametric trajectory models for speech recognition. In: Proceedings of International Conference of Acoustics, Speech and Signal Processing, pp. 466–469. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R., 2004. Neighbourhood components analysis. In: Proceedings of Neural Information Processing Systems, pp. 13–18. Gong, Y., 1997. Stochastic trajectory modeling and sentence searching for continuous speech recognition. IEEE Trans. Speech Audio Process. 5, 33–44. Haeb-Umbach, R., Ney, N., 1992. Linear discriminant analysis for improved large-vocabulary continuous-speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. San Francisco, CA, pp. 9–12. Halberstadt, A.K., Glass, J.R., 1997. Heterogeneous acoustic measurement for phonetic classiﬁcation. In: Proceedings of Eurospeech, pp. 401–404.

H. Huang et al. / Speech Communication 76 (2016) 28–41 Halberstadt, A.K., 1998. Heterogeneous Acoustic Measurements and Multiple Classiﬁers for Speech Recognition, Ph.D. thesis. MIT. Han, Y., de Veth, J., Boves, L., 2007. Trajectory clustering for solving the trajectory folding problem in automatic speech recognition. IEEE Trans. Audio, Speech, Lang. Process. 15 (4), 1425–1434. He, X., Niyogi, P., 2004. Locality preserving projections. In: Proceedings of Neural Information Processing Systems. Hermansky, H., 2010. History of modulation spectrum in ASR. In: Proceedigns of the International Conference of Acoustics, Speech and Signal Processing, pp. 5458–5461. Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B., 2012. Deep neural networks for acoustic modelling in speech recognition. IEEE Signal Process. Magaz., 82–97 Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. J. Edu. Psychol. 24, 417–441, 498–520. Huang, H., Liu, Y., Gemmeke, J., ten Bosch, L., Cranen, B., Boves, L., 2011. Globality-locality consistent discriminant analysis for phone classiﬁcation. In: Proceedings of Interspeech. Huang, Y., Yu, D., Liu, C., Gong, Y., 2014. A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Interspeech 2014

41

Reynolds, T., Antoniou, C., 2003. Experiments in speech recognition using a modular MLP architecture for acoustic modelling. Inf. Sci., 39–54 Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), 2323–2326. http://dx.doi. org/10.1126/science.290.5500.2323. Russell, M., 1993. A segmental HMM for speech pattern modelling. In: Proceedings of the International Conference of Acoustics, Speech and Signal Processing, pp. 499–502. Sakai, M., Kitaoka, N., Takeda, K., 2009. Feature transformation based on discriminant analysis preserving local structure for speech recognition. In: Proceedings of International Conference of Acoustics, Speech and Signal Processing, pp. 3813–3816. Seide, F., Li, G., Chen, X., Yu, D., 2011. Feature engineering in contextdependent deep neural networks for conversational speech transcription. ASRU 2011. IEEE,

Copyright © 2020 COEK.INFO. All rights reserved.