- Email: [email protected]

Journal Pre-proof

Joint Dimensionality Reduction and Metric Learning for Image Set Classification Wenzhu Yan, Quansen Sun, Huaijiang Sun, Yanmeng Li PII: DOI: Reference:

S0020-0255(19)31152-1 https://doi.org/10.1016/j.ins.2019.12.041 INS 15084

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

14 April 2019 19 December 2019 21 December 2019

Please cite this article as: Wenzhu Yan, Quansen Sun, Huaijiang Sun, Yanmeng Li, Joint Dimensionality Reduction and Metric Learning for Image Set Classification, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.12.041

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.

Joint Dimensionality Reduction and Metric Learning for Image Set Classification Wenzhu Yana , Quansen Suna,∗, Huaijiang Suna,∗, Yanmeng Lia a School

of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China

Abstract Compared with the traditional classification task based on a single image, an image set contains more complementary information, which is of great benefit to correctly classify a query subject. Thus, image set classification has attracted much attention from researchers. However, the main challenge is how to effectively represent an image set to fully exploit the latent discriminative feature. Unlike in previous works where an image set was represented by a single or a hybrid mode, in this paper, we propose a novel multi-model fusion method across the Euclidean space to the Riemannian manifold to jointly accomplish dimensionality reduction and metric learning. To achieve the goal of our framework, we first introduce three distance metric learning models, namely, Euclidean-Euclidean, RiemannianRiemannian and Euclidean-Riemannian to better exploit the complementary information of an image set. Then, we aim to simultaneously learn two mappings performing dimensionality reduction and a metric matrix by integrating the two heterogeneous spaces (i.e., the Euclidean space and the Riemannian manifold space) into the common induced Mahalanobis space in which the within-class data sets are close and the between-class data sets are separated. This strategy can effectively handle the severe drawback of not considering the distance metric learning when performing dimensionality reduction in the existing set based methods. Furthermore, to learn a complete Mahalanobis metric, we adopt the L2,1 regularized metric matrix for optimal feature selection and classification. The results of extensive experiments on face recognition, object classification, gesture recognition and handwritten classification demonstrated well the effectiveness of the proposed method compared with other image set based algorithms. Keywords: Image set classification, feature learning, kernel, dimensionality reduction, metric learning, heterogeneous space fusion

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15

1. Introduction Classification task is one of the most important research topics in the fields of computer vision and machine learning. Notably, numerous state-of-the-art methods based on a single image classification have been proposed [1– 3]. However, with the rapid development of computer and image processing technologies, it is rather convenient to acquire various images of subjects from many real-world applications including video surveillance, personal photo albums and camera networks. Thus, a new crucial research topic of learning from image sets is proposed [4–9]. Image set classification provides more information to effectively deal with the typical appearance variations within images including: variations in illumination, viewpoint changes, and occlusions. Generally, based on the input of the set based classification task, there usually exist two key parts: feature extraction and image classification, which are illustrated in Fig. 1. Feature extraction is the key problem, whose purpose is to extract discriminative feature to exploit the rich information within the image sets. And image classification aims to design effective classifiers/models to classify different query subjects. Based on the above two phrases, a series of nonparametric set modeling methods based on subspace [10–12], manifold [5, 13], affine/convex hull [4, 14, 15] etc, have been proposed to effectively deal with different visual classification tasks. Another line of research aims to use parametric statistical models [7, 9, 16, 17] to represent an image set. Specifically, some previous works adopt a single Gaussian model [16] or Gaussian Mixture ∗ Corresponding

author Email addresses: [email protected] (Wenzhu Yan), [email protected] (Quansen Sun ), [email protected] (Huaijiang Sun ), [email protected] (Yanmeng Li) Preprint submitted to Elsevier

December 23, 2019

Input

How to effectively model a set ? Previous works: single model: (AHISD,

Output

MMD,DLRC, DCC,CDL... )

hybrid model:

feature

(LMKML,CERML)

the label classify l (i 1, 2,..., c ) i

the latent information across different spaces

fusion model

An image set X i

Feature extraciton

Classification

Figure 1: The input/output of image set classification case and the key problem. 16 17 18 19 20 21 22 23 24

Model (GMM) [17] to precisely characterize the variations within images in a set. Then, the dissimilarity between two sets can be measured by adopting the Kullback-Leibler (K-L) divergence. Furthermore, some single image based algorithms [18–20] have been extended to handle the image set classification problem, and they eventually adopt a majority voting strategy to classify a query set. To our knowledge, as the performance of a classifier largely relies on the quality of features, how to construct effective model representations that are invariant and robust to many realworld variations is the main challenge in image set classification. As shown in Fig. 1, the latent information across different spaces can be exploited in our fusion model, which is superior to previous works, for the reason that the traditional methods generally represent an image set by a single mode or a hybrid mode, exhibiting the drawback of not considering the complementary correlation across different spaces. To be specific, Fig. 2 gives an illustration of distance metric based on different model representations. It has been shown in other studies [7, 21, 21–24] that differ-

Euclidean metric

Affine hull metric

Point to Point

(a)

Matrix (Manifold) metric

Set to Set Euclidean+Manifold metric

X1

X2

Set

Set

Point (Set) to Point (Set) (c)

x1

Point to Set (b)

Point

Figure 2: The widely used distance metrics. For image set classification (a), we can adopt distance metrics based on Euclidean point to Euclidean point, affine hull to affine hull, manifold to manifold, and their hybrid model to compute the dissimilarity between two sets. For single image to set based classification (b), we can obtain the dissimilarity by using the point to set distance metric. The part (c) is our fusing Point(Set) to Point(Set) distance metric used to handle image set classification problem. 25 26 27 28 29 30 31 32 33 34 35

ent modes represent the image set from different perspectives, specifically, the mean vector and the covariance matrix reflect two different statistical features of an image set, and these features can provide complementary information to fully represent the target set. As the mean vector is considered as a representation point in the Euclidean space and the covariance matrix essentially lies on a specific Riemannian manifold, we can naturally present the distance metrics based on Euclidean to Euclidean, and Riemannian to Riemannian metric learning in their corresponding spaces. Furthermore, to better exploit the potential correlation between the different representations of an image set, we incorporate the point to set distance metric into the image set classification problem which can be formulated as the case of matching Euclidean points with Riemannian points, can eventually enhance the classification performance. To specifically describe the framework of fusing these two heterogeneous spaces in a unified formulation, we employ the kernel trick [23, 25, 26] to map the multi-models into a common induced space. Moreover, we adopt a data-dependent 2

60

manifold kernel to fully exploit the geometry structure of manifold space by using the unlabeled data which is easily available nowadays. As described in [12, 14, 15, 27, 28], no matter how the set is modeled, it usually requires principal component analysis (PCA) as a beneficial pre-processing tool as it enables us to reduce the computational burden and extract effective features that is robust to data noise. Some works [29, 30] attempt to consider the distance metric jointly with the dimensionality reduction strategy. For the application of single image based person re-identification in [29], the intrapersonal and extrapersonal variations of subjects are described by multivariate Gaussian distributions. Then, a joint dimension reduction and metric learning method is proposed by simultaneously learning a subspace and a restricted quadratic discriminant analysis (RQDA) distance function. However, this method does not work well with the image set classification problem when the image sets hold weak underlying distributional assumptions. Harandi et al. [30] aimed to address the feature extraction problem in a low dimensional representation from a geometric perspective, which has great storage burden and computational complexity. In this work, the outstanding contribution of our new multi-model fusion method aims to jointly accomplish dimensionality reduction and metric learning to exploit the different latent discriminative feature for the image set classification problem. To sum up, our work provides the following contributions: 1) Three low dimensional distance representations (point to point, point to set and set to set) are described to model image sets 2) We design an efficient joint learning framework by simultaneously learning two mappings performing dimensionality reduction and a metric matrix by integrating the two heterogeneous spaces (i.e., the Euclidean space and the Riemannian manifold) into the common induced Mahalanobis space in which the within-class data sets are close and the between-class data sets are separated. To make our model is optimal for feature selection and classification, we regularize the metric matrix by using the L2,1 norm to create a penalty of sparsity. 3) A new fast optimization algorithm is also developed to solve the resulting nonconvex problem. Extensive experiments on four visual classification tasks are conducted to demonstrate the effectiveness of the proposed method. The rest of this paper is structured as follows. In the next section, we give an overview of the set based classification methods. Then, we illustrate the proposed method in detail and introduce the global objective function in Section 3. In Section 4, we justify the effectiveness of our proposed methods via extensive experiments. Finally, the conclusion is presented in Section 5.

61

2. Related works

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

Numerous works have been proposed to deal with the image set based classification task. In this section, we provide an overall review of the related works, which are discussed as follows. Some works model an image set as a linear subspace, and then adopt Canonical Correlation Analysis (CCA) [31] to find principal angles which can be used to calculate the subspace similarity [10, 11]. Methods such as the Mutual Subspace Method (MSM) [10] and Orthogonal Subspace Method (OSM) [11], have shown promising results. However, for sets with large images and an extensive range of variations, these methods cannot effectively exploit all the information comprised in the images. Based on CCA, Discriminative Canonical Correlations (DCC) [12] calculates the subspace similarity by incorporating the discriminative information. Thus, it can obtain better results. Some methods aim to use the affine hull or convex hull to represent an image set [4]. For example, Sparse Approximated Nearest Points (SANP) adopts the affine hull model to interpret invisible appearance variations and forces the nearest data points to be close by adding sparsity constraints [14]. Regularized Nearest Point (RNP) [15] represents an image set by using the regularized affine hull, which leads to less model complexity than that of SANP. To find some represented prototypes to better measure the set to set distance, Wang et al [32] jointly learned the prototypes from the corresponding affine hull and a linear discriminative projection to handle the image set classification problem. Moreover, based on the concept of Linear Regression Classification (LRC) [33] for image reconstruction, a series of methods have been proposed to extend LRC to deal with the set based classification task [6, 34], including Dual Linear Regression Classification (DLRC) [6], Pairwise Linear Regression Classification (PLRC) [34]. However, for these methods based linear regression mechanism, the dimension of the feature vectors should be much larger than the number of images in the combined new sets when calculating the between-set dissimilarity by the distance between the virtual images reconstructed from the original data sets. To improve the classification performance, SJSRC [8] adopts a set-level joint sparse representation model to classify a query subject by using the minimal reconstruction residual. Methods modeling image sets as local linear manifold components can effectively capture the variations information [5, 13, 35, 36]. Manifold-Manifold Distance (MMD) [5] adopts the manifold to manifold distance to measure 3

110

the set dissimilarity. Manifold Discriminant Analysis (MDA) [13] extends MMD to further exploit the latent discriminative information in a projected low dimensional space. Moreover, from the geometric take, Huang et al. [28] represented an image set as a point on Grassmann manifold, then, they performed a dimensionality reduction method which aims to embed the original Grassmann manifold into a lower-dimensional manifold space where discriminative features can be naturally exploited. For the metric learning, Log-Euclidean Metric Learning (LEML) [37] adopts Symmetric Positive Definite (SPD) matrices to represent the image sets and performs directly on logarithms of these SPD matrices to measure the distance between sets. Localized Multi-Kernel Metric Learning (LMKML) [22] adopts the different order statistics of an image set to learn a distance metric. However, as these statistics are combined in series directly, the redundancies in LMKML may result in great computational complexity. Huang et al. [23] proposed a metric learning framework to handle Video-to-Still (Still-to-Video) problem by exploiting the mutual information across the Euclidean and Riemannian manifold spaces, while for Video to Video (set-set) problem, their framework only adopts a hybrid metric learning method which lacks of the ability to fuse the latent complementary information across the two heterogeneous spaces. Additionally, some statistical models [7, 9, 16, 17] have been proposed to effectively model image sets, including the single Gaussian model [16] and the GMM [9, 17, 31]. The between set dissimilarity is eventually measured by adopting the KL-divergence. However, these methods typically suffer from the problem that the query image set has weak statistical correlations with the training sets, which leads to larger fluctuations in performance. By modeling an image set with its covariance matrix, Covariance Discriminative Learning (CDL) [38] conducts the kernel discriminative analysis to address complex data distribution. Recently, deep learning has shown its potential capability to tackle the image set based classification task [39, 40]. Hayat et al [39]. adopted a multilayer neural network to obtain the feature representation for an image set and used the minimal representation residual to classify a query set. To explore the discriminative ability of deep network, Lu et al. [40] mapped the original data manifold into a feature space to enhance the classification performance. Although deep learning based methods have achieved relatively good performance, they definitely need a great number of training image sets and superior computational platforms.

111

3. The proposed approach

112

3.1. Problem formulation

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109

115

In this paper, we focus on exploiting the multi-model fusion by jointly learning the lower-dimensional representation of data sets and the Mahalanobis distance to effectively deal with the image set classification problem. Firstly, we introduce the following Definition.1.

116

Definition 1. The classical Mahalanobis distance (MD) can be used to define the distance between two points xi and x j as follow: q (1) d(xi , x j ) = (xi − x j )T M(xi − x j ),

117

where M is the Mahalanobis metric matrix, which is positive definite.

113 114

119

According to the Definition.1, when we map the multi-model of data set into the common space, the pairwise multimodual distance metric can be described in the Definition.2.

120

Definition 2. Suppose that xi and r j are two modual representations of the image set, then, the multimodual Mahalanobis distance is q (2) d(xi , r j ) = (f x (xi ) − fr (x j ))T M xr (f x (xi ) − fr (x j )),

118

121 122

123 124

where the two mapping functions f x and fr are used to learn the distance metric across different moduals. The positive definite matrix M xr is the Mahalanobis matrix. Given data sets S, the Euclidean model of S is X = [x1 , x2 , ...xn ], xi ∈ Rd with the labels Li ∈ (1, 2..., C), where C is the number of classes. The manifold formation of S is R = [r1 , r2 , ...rn ], ri ∈ Ω which shares the labels with 4

125 126 127

the Euclidean data. We jointly fuse these two data models of S to fully exploit the structure of the data sets , as they can provide the complementary information to each other. Our new multi-model fusion method aims to jointly accomplish dimensionality reduction (DR) and metric learning (ML) to exploit the latent discriminative feature for image set classification. The flowchart of the proposed method is shown in Fig.3.

1

Px margin

Model 1

fuse RHKS

2

margin

Pr margin

M

Model 2

Mapping

ML

DR

Construct graph

Data strucuture

Joint learning

Figure 3: Joint dimensionality reduction and metric learning for image set classification 128

As shown in Fig.3, the image sets are represented by different models: Model 1 is represented in the Euclidean space and Model 2 is modeled in the Riemannian manifold, respectively. Unlike in previous works where the schemes directly operate on the target data in the original space, we adopt the kernel technique to map the multi-models of data sets into the Hilbert space to obtain the nonlinear separable high-dimensional information. Specifically, the commonly used Radial Basis Function (RBF) kernel is used in the Euclidean space, and the data-dependent kernel is adopted in the Riemannian manifold space by using the unlabeled testing data constructed in a graph to better exploit the geometry structure of the nonlinear manifold, as this semi-supervised learning strategy can strongly penalize the weak statistical correlations between the training and testing manifold representations. Then, we jointly learn two projection matrices performing dimensionality reduction (P x , Pr ) and a Mahalanobis metric matrix (M) by integrating the two heterogeneous Euclidean and Riemannian spaces into a common space. This unified learning framework can yield feature extraction directly accomplished in a low dimensional subspace, which is different from methods based on a pre-processing by PCA. The learning distance metric leads to the compactness of within-class sets enhanced, and between-class data samples better separated. Notably, to obtain the most useful basis elements which can be beneficial to feature selection, we adopt the L2,1 regularization on the metric matrix to create a penalty of sparsity that can exhibit effective feature interpretability. The L2,1 -norm has been used in various fields which can be defined as [41] v d tX w d X X kMk2,1 = m2i j = kMi k2 , (3) i=1

129 130

131 132 133 134 135 136

j=1

i=1

where M ∈ Rd×w , and mi is the row vector of M. In a word, our proposed method aims to integrate the dimensionality reduction and the sparse feature extraction into a unified framework. 3.2. Joint dimensionality reduction and metric learning (JDRML) Once we obtain the two heterogeneous models of the target image sets from the Euclidean and Riemannian manifold spaces respectively, we further aim to transform them into a common space, in which the pairwise model distance metric can be properly described. To achieve this, we adopt the strategy of kernel method to obtain the high dimensional representation of two models. To be specific, the kernel mapping representation of two models will be introduced as follows. For the Euclidean model, given two points xi and x j , we adopt the commonly used RBF kernel k(xi , x j ) = exp(kxi − x j k2 /2σ2 ). 5

(4)

For the manifold model, given two Riemannian manifold representations ri and r j , we adopt the manifold based kernel to encode the intrinsic structure of the data sets k(ri , r j ) = tr(log(ri )log(r j )).

(5)

Furthermore, to fully specify the data manifold structure information, by using the semi-supervised setting that definitely has access to the available unlabeled data, we construct a fully trusted graph by using the manifold distance metric and employ a kernel deformation strategy [25] to derive the new data dependent kernel as follows: ˜ i , r j ) = k(ri , r j ) − (kri )T (I + LK)−1 Lkr j , (6) k(r 137 138

where L is the laplacian matrix, I is the identity matrix, k is the kernel function in the original reproducing kernel Hilbert space (RKHS), K is the original kernel matrix, kri = [k(ri , r1 ), ..., k(ri , rn )]T and kr j = [k(r j , r1 ), ..., k(r j , rn )]T . On the basis of the above kernels, we obtain the implicit nonlinear kernel transformations which respectively map the Euclidean space Rd and the Riemannian manifold Ω into two RKHSs. Thereafter, we learn two projection matrices P x and Py to obtain the lower-dimensional representations which can preserve the energy of each model as much as possible, and simultaneously pursue discriminant function based metric learning to obtain better discriminable performance, i.e., learning the Mahalanobis matrix M to reflect the class similarity. The Mahalanobis matrix for each task is ensured to be positive semi-definite. Before fusing different models into a unified framework, we first introduce three different Mahalanobis distance with the lower dimensional representation based on EuclideanRiemannian, Euclidean-Euclidean, and Riemannian-Riemannian metric learning. The Mahalanobis distance between the Euclidean point xi and the manifold point r j can be described as follows: d1 (xi , r j ) = (PTx φ(xi ) − PTr φ(r j ))T M(PTx φ(xi ) − PTr φ(r j )).

(7)

PTx φ(xi ) = WTx Φ(X)T φ(xi ) = WTx K xi ,

(8)

PTr φ(r j ) = WTr Φ(R)T φ(r j ) = WTr Kr j .

(9)

d1 (xi , r j ) = (WTx K xi − WTr Kr j )T M(WTx K xi − WTr Kr j ).

(10)

d2 (xi , x j ) = (WTx K xi − WTx K x j )T M(WTx K xi − WTx K x j )

(11)

d3 (ri , r j ) = (WTr Kri − WTr Kr j )T M(WTr Kri − WTr Kr j ).

(12)

H = kK x − WTx W x K x k2F + kKr − WTr Wr Kr k2F .

(13)

Considering that P x can be rewritten as a linear combination of all training sets in the kernel space according to the Euclidean space, i.e, P x = Φ(X)W x and similarly, Pr can be expressed as Pr = Φ(R)Wr for the Riemannian manifold representation. Then, we have

Thus, Eq. (7) can be rewritten as Similarly, the Mahalanobis distance between the Euclidean points of xi and x j in the low-dimensional space can be expressed as and the Mahalanobis distance between the manifold points of ri and r j in the low-dimensional space can be written as The two projection matrices W x and Wr are learned to obtain the lower-dimensional representations which can preserve the energy of each model as much as possible. We constrain them as To learn a latent space whose Mahalanobis distance can definitely reflect the class similarity, we adopt the learning criterion from a large margin to keep each input data closer to its neighbors with the same label and far away from other inputs with different class labels. We express the relation between the Euclidean model and the manifold model by a linear inequality constraint J1 , considering the effect of the similarity and dissimilarity constraints. Similarly, the J2 and J3 are used to represent the distances for the Euclidean and the manifold models, respectively, n n X X d1 (xi , r j ) ≥ 1 − ξ1 , J1 = d1 (xi , r j ) − (14) i, j=1,(l xi ,lr j )

i, j=1,(l xi =lr j ,i, j)

6

J2 =

n X

d2 (xi , x j ) −

n X

d3 (ri , r j ) −

i, j=1,(l xi ,l x j )

J3 =

i, j=1,(lri ,lr j ) 139

n X

d2 (xi , x j ) ≥ 1 − ξ2 ,

(15)

n X

d3 (ri , r j ) ≥ 1 − ξ3 ,

(16)

i, j=1,(l xi =l x j ,i, j)

i, j=1,(lri =lr j ,i, j)

where ξi ≥ 0, i = 1, ...m, m = 3 are the slack variables. Inspired by the graph constraint and in order to effectively represent our model, we define the following three matrices: 1 l xi = lr j , i , j −1 l xi , lr j D(i, j) = (17) 0 else,

1 l xi = l x j , k1 (i, j) −1 l xi , l x j k2 (i, j) D x (i, j) = (18) 0 else, 1 lri = lr j , k1 (i, j) −1 lri , lr j k2 (i, j) Dr (i, j) = (19) 0 else, where k1 (i, j) represents the nearest neighbors belonging to the same class and k2 (i, j) represents those belonging to the different class. Then, we rewrite the Eq. (14-16)) in matrix formulation as

140 141

J1 = −tr(G1 M) where G1 = WTx K x A0x KTx W x + WTr Kr A0r KTr Wr − 2WTx K x DKTr Wr ,

(20)

J2 = −tr(G2 M) where G1 = 2WTx K x A x KTx W x − 2WTx K x D x KTx W x = 2WTx K x L x KTx W x ,

(21)

J3 = −tr(G3 M) where G1 = 2WTr Kr Ar KTr Wr − 2WTr Kr DKTr Wr = 2WTr Kr Lr KTr Wr , (22) Pn Pn Pn 0 0 0 0 where A x , Ar , A x , Ar are diagonal matrices with A x (i, i) = j=1 D(i, j), Ar ( j, j) = i=1 D(i, j), A x (i, i) = j=1 D x (i, j), P Ar (i, i) = nj=1 Dr (i, j). Thus, we define the L2,1 -norm regularization objection function as follows: m X min J = kMk2,1 + λ1 (kK x − WTx W x K x k2F + kKr − WTr Wr Kr k2F ) + C i ξi M,W x ,Wr

i=1

145

(23) i = 1, ...m, m = 3 s.t. −tr(Gi M) ≥ 1 − ξi ξi ≥ 0, where the L2,1 -norm is adopted to obtain optimal sparse feature extraction and λ1 is used to balance the part of the regularizations for the projection matrices W x and Wr . ξi ≥ 0(i = 1, 2, 3) are the slack variables used to penalize large distances for three different distance metrics based on Euclidean-Riemannian, Euclidean-Euclidean, and RiemannianRiemannian metric learning described in Eq. (14)-Eq. (16) with the corresponding balancing parameters Ci .

146

3.3. Optimization

142 143 144

The optimization problem in Eq. (23) is complicated to solve as it is a non-convex problem. We propose a new fast and simple algorithm to solve the problem in Eq. (23) by modifying the inequality constraints into the equality constraints, which is proved to be an effective strategy to solve the quadratic programming problem in [42]. Thus, we

7

rewrite the Eq. (23) as min J = kMk2,1 + λ1 (kK x − WTx W x K x k2F + kKr − WTr Wr Kr k2F ) +

M,W x ,Wr

m X

Ci ξi

i=1

i = 1, ...m, m = 3 s.t. −tr(Gi M) = 1 − ξi . Then, after some modifications, we have min J = kMk2,1 − λ1 tr(WTx K x KTx W x + WTr Kr KTr Wr ) + tr(C1 G1 + C2 G2 + C3 G3 )M. M,W x ,Wr

147 148 149

(24)

(25)

As the optimization problem in Eq. (25) is not convex for W x , Wr , and M simultaneously, we adopt an alternative optimized solution to solve one variable with the others fixed. Each sub-optimization problem has a closed-form solution. The optimization steps are shown in Algorithm 1. Step 1: Learn M with fixed W x and Wr , Eq. (25) can be rewritten as: min J = kMk2,1 + tr(C1 G1 + C2 G2 + C3 G3 )M, (26) M

then, we have min J = tr(MT UM) + tr(C1 G1 + C2 G2 + C3 G3 )M. M

150 151 152

153 154

155

(27)

As features in different domains usually have sparse correspondences, the matrix U is constrained to be sparse. Then, we have ∂J (28) = UM + (C1 G1 + C2 G2 + C3 G3 )T = 0. ∂M Considering that the Mahalanobis matrix M needs to be positive semi-definite, we can project M onto a semi-definite cone after every iterative step. Then, M can be obtained by the procedure in Algorithm 2. Then, we introduce the following Lemma 1 to show that the iterative optimization procedure in Algorithm 2 is convergent. Lemma 1. Supposed that ftva is the value of the objective function in Eq.(26) at the t-th iteration, after t + 1-th va iteration, we have ft+1 ≤ ftva . Proof. We show the proof of Lemma 1 in the Appendix. Step 2: Learn W x with fixed M and Wr , Eq. (25) can be rewritten as: min J = tr(−λ1 WTx K x KTx W x + (C1 WTx K x A0x KTx W x − 2C1 WTx K x DKTr Wr + 2C2 WTx K x L x KTx W x )M), Wx

(29)

then, ∂J = −λ1 K x KTx W x + (C1 K x A0x KTx + 2C2 K x L x KTx )W x M − 2C1 K x DKTr Wr M = 0, ∂W x

(30)

we have W x M − λ1 (C1 K x A0x KTx + 2C2 K x L x KTx )−1 K x KTx W x = 2C1 (C1 K x A0x KTx + 2C2 K x L x KTx )−1 K x DKTr Wr M.

156

Eq. (31) is typically a Sylvester equation. Step 3: Learn Wr with fixed M and W x , Eq. (25) can be rewritten as: min J = tr(−λ1 WTr Kr KTr Wr + (C1 WTr Kr A0r KTr Wr − 2C1 WTr Kr DKTx W x + 2C3 WTr Kr Lr KTr Wr )M), Wr

(31)

(32)

then, ∂J = −λ1 Kr KTr Wr + (C1 Kr A0r KTr + 2C3 Kr Lr KTr )Wr M − 2C1 Kr DKTx W x M = 0, ∂Wr

(33)

we have Wr M − λ1 (C1 Kr A0r KTr + 2C3 Kr Lr KTr )−1 Kr KTr Wr = 2C1 (C1 Kr A0r KTr + 2C3 Kr Lr KTr )−1 Kr DKTx W x M.

157

Eq. (34) is typically a Sylvester equation.

8

(34)

Algorithm 1 JDRML Input: K x , Kr , D, D x , Dr , the tradeoff parameters λ1 , C1 , C2 , C3 , iteration numbers t1 . Output: M, W x , Wr . Initialize Mi , W xi , Wri . repeat Step 1: Fix W x , Wr and solve M via Eq. (28). Step 2: Fix M, Wr and solve W x via Eq. (31). Step 3: Fix M, W x and solve Wr via Eq. (34). until t1 reached. Algorithm 2 The procedure to obtain M Initialize iteration numbers t2 . repeat Update Mt+1 , via Step 1: M = −U−1 (C1 G1 + C2 G2 + C3 G3 )T . Step 2: project M onto the semi-definite cone by computing the eigen-decomposition of M. Update Uii = −/2kM i k2 until converged. 158

159 160 161

162 163 164 165 166 167 168 169 170

4. Experimental results In this section, we present the experimental results to evaluate the proposed method compared with state-of-theart image-set classification methods on four visual classification tasks: face recognition [43–45], object classification [46], gesture recognition [47] and digit classification [48]. 4.1. Datasets and parameter settings For the face recognition task, the challenging YouTube Celebrities (YTC) dataset [43] collected from YouTube has been widely used to evaluate face recognition in the previous works [4–6, 12, 28, 37, 38]. There exist more than 1000 video pieces belonging to 47 subjects. The face images within the YTC dataset have large variations in pose illumination expressions as well as low resolution. The Extended Yale Face Database B (EYaleB) [44] consists of 16,128 images of 28 classes. Nine face image sets per class are contained in this dataset. Some face examples from the YTC and EYaleB datasets are shown in Fig.4(a) and (b), respectively. Moreover, we use an up-to-date version of the COX dataset [45] to evaluate the performance of video based face recognition under the typical application such as video surveillance. This dataset contains 1000 different subjects captured rich variations. Each subject has three videos. 10

20 20

40 30 40

60

50

80 60

100 70 80

120 50

100

150

20

200

(a) YTC

40

60

80

100

120

140

160

(b) EYaleB

Figure 4: Some face examples from (a) YTC and (b) EYaleB, respectively 171 172 173 174 175

For the object classification task, we use the benchmark ETH-80 object dataset [46]. There exist 80 object sets from 8 categories including apples, cars, cows, cups, dogs, horses, pears and tomatoes. Each subject contains 10 subcategory sets and the number of images in each set is approximately 41. Some object examples extracted from this dataset are shown in Fig.5. 9

Figure 5: Some object images from ETH80 dataset 176 177

We use the Cambridge Gesture dataset [47] to evaluate the action recognition task. This dataset contains nine hand gesture classes. Each gesture includes 100 image sequences which can be divided into five illuminations and 10 motions from each of two subjects. Some examples of these gestures are given in Fig.6.

Figure 6: Some gesture images from hand Gesture dataset 178 179 180

For the handwritten digits recognition task, the MNIST dataset contains a total of 70,000 image samples which can be divided into 10 classes from 0 to 9. All the black and white digits images are resized to 20×20. Some exemplar images are shown in Fig.7. 10 20 30 40 50 60 70 80 50

100

150

200

250

300

Figure 7: Some exemplar images from MNIST dataset 181 182 183 184 185 186 187 188 189 190

To evaluate the classification performance of our proposed method, the comparisons of set based methods include Discriminant Canonical Correlation Analysis (DCC) [12], Manifold-to-Manifold Distance (MMD) [5], Manifold Discriminant Analysis (MDA) [13], Affine Hull based Image Set Distance (AHISD) [4], Convex Hull based Image Set Distance (CHISD) [4], Sparse Approximated Nearest Points (SANP) [14], Regularized Nearest Points (RNP) [15], Prototype Discriminative Learning (PDL) [32], Covariance Discriminant Learning (CDL) [38], Projection Metric learning (PML) [28], Discriminant Analysis on Riemannian manifold of Gaussian distributions (DARG) [7], Cross Euclidean-to-Riemannian Metric Learning (CERML) [23]. To obtain a fair comparison of the methods, the parameters are all empirically tuned. To be specific, we adopt the PCA to maintain 90% energy for DCC [12]. For AHISD, CHISD, SANP, and RNP, all the parameters are set 10

191 192 193 194 195 196 197 198 199

200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238

according to [4], [14], and [15], respectively. For MMD and MDA, we select the number of local linear patches from [5-20]. The other parameters in MDA are followed by the works in [13]. We use the kernel LDA without requiring other parameter configurations in [38]. For PML, the total energy is preserved 95% by adopting the PCA [28]. For DARG, as the model based on the Mahalanobis and Log-Euclidean distance can lead to the best recognition accuracy than others in [7], we use this distance metric to evaluate all the visual classification tasks in our experiments. The video to video classification case of CERML is adopted in this experiments [23]. For the experimental settings of the proposed JDRML 1 , λ1 is selected from [0.1-0.3] in the step of 0.05. C1 is selected from [0.2, 0.5, 0.8], and C2 and C3 are tuned from [0.0001, 0.005, 0.01, 0.015, 0.02] , respectively. The total number of iteration is set as 15. Some other key parameters of these methods are described in the following experiments. 4.2. Experimental results and analysis In our experiments, for the face recognition task on YTC and EYaleb datasets, we adopt the Viola-Jones algorithm in [49] to extract the face images. And the face images are resized to 30×30 and 20×20 respectively. We conduct our experimental configurations provided in [5, 8, 14, 32, 47]. Three image sets of each individual are randomly selected for training and the others for testing. For the COX dataset, each video frame is resized to 32×40 and the histogram equalization is adopted to reduce the lighting effects. We randomly select 300 subjects for training. The first two videos of the rest 700 subjects are used to perform the video to video evaluation. The experimental results for the face recognition task are shown in Table 1. From Table 1, we can obtain that our proposed method achieves the best accuracy rates on these three face recognition datasets. Specifically, on the YTC and COX datasets, all methods yield relatively lower identification rates as these two datasets include faces under a wide range of variations. Notably, the accuracy rates of the proposed JDRML can reach 78.1% and 93.44% on the YTC and COX, respectively, which indicates that fusing different model representations of the target set into a unified framework can help to fully exploit the mutual complementary information, and extracting effective sparse feature naturally enhances the classification performance. Besides, Convex/Affine hull based models (AHISD, CHISD, SANP and RNP) show competitive classification performance compared with those of the multiple linear subspace methods (MMD and MDA) on the YTC and EYaleb datasets. We can also see that models based on manifold (CDL, PML, DARG and CERML) yield higher recognition rates than those of other methods on the COX dataset, for the reason that the nonlinear manifold based on Riemannian or Grassmann structure considers the intrinsic geometry information. Furthermore, we also present the Receiver Operating Characteristic (ROC) curves in Fig.8 (a) and (b) for different methods on YTC and EYaleB, respectively. The proposed method clearly outperforms the other methods by producing the highest true positive rates against all false positive rates. To evaluate the classification performance on the ETH80 dataset, we randomly select half of the 10 object sets per subject for training and the rest for testing and adopt five fold cross validation experiments. The results of the comparison between our method and other competing methods are clearly shown in Table 1. As can be seen in the table, methods assuming that image sets lie on the Riemannian manifold (CDL, PML, DARG and CERML) can better exploit the latent structure information. Thus, they show relatively good performance. Our method can achieve a high classification rate of 94%. For the hand gesture recognition task, each of the 100 videos per gesture class is divided into five illuminations (Set1, Set2, Set3, Set4, and Set5). All video frames are resized to 20×20. Following the experimental protocol in [47], we select set5 for training and the rest of the sets (Set1, Set2, Set3, and Set4) for testing. From the results in Table 1, the proposed method outperforms the other methods, while the classification performance of these convex/affine hull based models (AHISD, CHISD, SANP and RNP) degrades significantly as they can be largely deteriorated by outliers. Unlike in the previous experimental settings on the MNIST database oriented to a single image classification model, in this experiment, we adopt the set based strategy to achieve the handwritten digits classification task. Thus, a comprehensive insight into image set classification is given. Specifically, all images are divided into 300 sets. Each digit subject includes 30 subsets, where each subset contains about 200 frames. 20 image sets of each individual are randomly selected for training and the others for testing. As shown in Table 1, most of the existing set based methods achieve 100% on this datset. Thus, set based models can greatly improve this case of handwritten digits classification.

239

1 https://github.com/zhuyeqingma/myJDRML.

11

Table 1: Performance of all methods on different datasets (%)

Method DCC MMD MDA AHISD CHISD SANP RNP CDL PML DARG PDL CERML JDRML

YTC 68.5±7.4 57.3±7.9 52.3±8.2 72.5±8.8 70.9±8.4 71.4±6.5 71.5±6.4 74.1±8.2 72.7±7.6 76.4±8.1 71.7±8.6 76.6±7.8 78.1±7.5

EYaleB 86.4±8.7 71.9±7.1 56.7±7.4 82.0±6.5 80.0±7.2 82.0±6.1 83.4±5.8 89.1±3.4 84.7±4.5 88.7±6.5 89.5±5.5 90.1±5.2 93.6±3.6

COX 62.53 38.29 65.83 53.03 56.90 57.82 58.07 78.43 71.27 83.71 65.8 90.31 93.44

Gesture 64.7 58.1 21.4 18.1 18.3 22.4 35.6 73.4 83.2 31.2 21.1 83.7 84.6

ETH80 85.3±6.9 81.2±6.5 65.4±7.3 71.0±8.7 68.0±7.2 67.0±6.3 67.0±6.4 88.3±6.2 86.0±6.5 84.4±7.2 73.0±6.2 85.0±2.5 94.0±2.2

Year 2007 2008 2009 2010 2010 2012 2013 2012 2015 2017 2017 2018

ROC curves

ROC curves 1

1

0.9

0.9

DCC AHISD CHISD MMD MDA SANP RNP CDL CERML JDRML

TPR

0.7

0.6

0.5

DCC AHISD CHISD MMD MDA SANP RNP CDL CERML JDRML

0.8

0.7 TPR

0.8

0.6

0.5

0.4

0.4

0.3

MNIST 99.2±4.87 93.8±3.8 84.5±1.2 72.0±2.1 99.7±5.1 99.8±3.2 100±0.0 100±0.0 100±0.0 100±0.0 99.7±2.7 99.8±6.7 100±0.0

0.3

0

0.1

0.2

0.3

0.4

0.5 FPR

0.6

0.7

0.8

0.9

1

(a) YTC

0

0.1

0.2

0.3

0.4

0.5 FPR

(b) EYaleB

Figure 8: ROC Curves on: (a) YTC and (b) EYaleB, respectively

12

0.6

0.7

0.8

0.9

1

Table 2: Classification performance (%) of JDRML and JDRML-DK on different datasets

Method JDRML JDRML-DK

YTC 78.1±7.5 79.8±7.9

EYaleB 93.6±3.6 95.2±4.3

COX 93.44 93.56

Gesture 84.6 85.2

ETH80 94.0±2.2 95.7±3.4

MNIST 100 100

Table 3: The classification performance and runtime for the set based methods.

N method PCA+DCC PCA+MMD PCA+RNP DARG CERML JDRML

p=10 0.72(21.3s) 0.15(0.12s) 0.41(0.011s) 0.58(101.4s) 0.35(22.5s)

270 p=50 0.80(51.2s) 0.14(0.12s) 0.42(0.02s) 0.60(110.9s) 0.43(58.8s) 0.82(59.7s)

p=80 0.84(83.6s) 0.18(0.14s) 0.45(0.05s) 0.62(118.8s) 0.47(91.1s)

p=10 0.74(28.8s) 0.18(0.18s) 0.41(0.09s) 0.53(82.5s) 0.46(22.1s)

450 p=50 0.85(77.6s) 0.16(0.17s) 0.43(0.12s) 0.64(120.2s) 0.61(50.2s) 0.88(98.1s)

p=80 0.90(134.9s) 0.28(0.21s) 0.52(0.17s) 0.67(132.5s) 0.56(87.8s)

245

To fully explore the manifold structure, we use the unlabeled testing data sets to construct the data dependent kernel and then we set the number of the neighbors in Eq.(6) as 6. The results of the comparison between the proposed JDRML and JDRML with the data dependent manifold kernel (JDRML-DK) are shown in Table.2. We can see in the table that JDRML-DK shows the best classification performance, as a semi-supervised learning strategy with the unlabeled testing data samples can strongly penalize weak statistical correlations between the training and testing manifold points.

246

4.3. Comparison of different features and runtime

240 241 242 243 244

259

In this section, we analyze the classification performance of set based methods with different number of dimensions. For DARG, CERML and JDRML, the image sets are embedded into RKHS by performing a non-linear transformation to obtain the nonlinear separable high-dimensional information of the original set data. Unlike DARG and CERML, the proposed JDRML learns the low dimensional representation with metric learning in a unified framework. We conduct experiments to evaluate the role of dimensionality reduction for kernelized data sets and to analyze the classification performance of PCA based subspace model (PCA+DCC), convex hull model (PCA+RNP), and local linear subspace manifold method (PCA+MMD). In the experiments, we select the number of image sets to 270 and 450 for training. And the dimensional feature (p) is reduced to three types which are shown in Table.3. The recognition accuracies and the computational time on the Gesture dataset for different methods are shown in Table.3. From the Table.3, we can see that PCA based set classification methods show faster runtimes with relatively lower dimensional feature, and the recognition performance is obviously affected by the different selected dimensional features. Our proposed joint learning method outperforms most of the other set based classification methods, and has comparable runtimes.

260

4.4. Parameter analysis

247 248 249 250 251 252 253 254 255 256 257 258

261 262 263 264 265 266

267 268 269

4.4.1. Convergence analysis Theoretically, for our objection function in Eq.(25), it is convex for one variable when the others are fixed. We use the data sets of the YTC and ETH80 as examples to illustrate the optimization process of our method. The curves of the objective function vs. the number of iterations are plotted in Fig.9. We carry out a five-fold cross validation. From Fig.9, we can see that after several iterations, the value of the objective function becomes stable, and that our method shows insensitivity to the number of iterations to some extent. 4.4.2. Performance analysis with different parameter settings In this section, we evaluate the parameter sensitivity of the proposed JDRML on three datasets: YTC, ETH80, and Gesture. The parameter λ1 is selected from [0.1-0.3] in the step of 0.05, which is used to balance the part of 13

300

18 16

250

Objective function value

Objective function value

14 200

150

100

12 10 8 6 4

50 2 0

0

5 10 The number of iterations

0

15

1

2

3

4 5 6 7 The number of iterations

(a) YTC

8

9

10

(b) ETH80

Figure 9: Convergence curves of JDRML on (a) YTC and (b) ETH80 datasets

regularizations on the projection matrices W x and Wr . We show the recognition accuracies with different λ1 on three datasets in Fig.10. From the figure, we can see that the recognition accuracies on different datasets are relatively stable with different λ1 . 1

1

1

0.9

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6 0.5 0.4

Recognition rate

Recognition rate

271

Recognition rate

270

0.6 0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0

0.1

0.12

0.14

0.16

0.18

0.2 λ1

(a) YTC

0.22

0.24

0.26

0.28

0.3

0

0.1

0.1

0.12

0.14

0.16

0.18

0.2 λ1

0.22

(b) ETH80

0.24

0.26

0.28

0.3

0

0.1

0.12

0.14

0.16

0.18

0.2 λ1

0.22

0.24

0.26

0.28

0.3

(c) Gesture

Figure 10: Recognition accuracy of JDRML on YTC (a), ETH80(b) and Gesture (c) datasets with different λ1 272

280

Then, we further conduct experiments to evaluate the rest three regularized parameters : C1 , C2 and C3 , which are used to balance the contributions of the three parts (G1 , G2 , and G3 ) in Eq.(25). The C1 is selected from [0.2,0.5,0.8]. With a fixed C1 , we obtain the recognition performance with different parameters of C2 and C3 on three datasets in Fig.11-Fig.13, where C2 and C3 are tuned from [0.0001,0.005,0.01,0.015,0.02], respectively. The color bar displays different recognition performance. From the Fig.11-Fig.13, on three datasets with a cross validation of C2 and C3 , we can select the optimal parameter settings to obtain the best classification performance. For the three different visual classification tasks in our experiments, we set C1 to 0.5 as it shows relatively better classification performance in Fig.11(b)-Fig.13(b).

281

5. Conclusion

273 274 275 276 277 278 279

282 283 284 285 286 287

In this study, in order to effectively deal with the image set classification problem, we put study emphases on multimodel fusion representation which provides an effective strategy to fully extract the latent discriminative feature. To accomplish this work, we first employ the kernel trick to map different representations of the image sets (in the Euclidean space and the Riemannian manifold) into the Reproducing kernel Hilbert space. Thereafter, three Mahalanobis distance metric learning models are given. Then, we aim to jointly learn two projection matrices and a metric matrix by integrating the two heterogeneous Euclidean space and Riemannian manifold into a common induced 14

C1=0.5

C1=0.2

1

1

0.6

0.6

0.4

0.5

0.9 0.8

0.8 0.7

0.6

0.6

0.4

0.5

0.2

Recognition rate

0.7

1

0.9 Recognition rate

0.8

0.8

1 1

1 0.9 Recognition rate

C1=0.8

0.015

0.02 0.015

0.01 0

0.4 0.2

(a) C1 =0.2

0.02

0

0.015

0.01

C2

0.005 0

C3

(b) C1 =0.5

0

0.01

0.005

0.005 0

C3

C2

0.015

0

0.01

0.005

0.005 0

C3

0.02 0.015

0.01

0.01

0.005

0.6

0.6

0.02 0.015

0

0.7

0.5

0.2

0.02

0.02

0.8

0.8

0

C2

(c) C1 =0.8

Figure 11: Recognition performance with different parameters of C1 and C2 on YTC dataset

C1=0.2

C1=0.8

C1=0.5

1

1

1

1

1

1

0.8

0.6

0.7 0.6

0.4

0.5

0.9

0.9 0.8

0.8 0.7

0.6

0.6

0.4

Recognition rate

0.8

Recognition rate

Recognition rate

0.9

0.2 0.015

0.02 0.015

0.01 0

0.4 0.2

0.02

0.02

0

0.015

0.005 0

C3

0.6

0.6

0.2

0.01

0.005

0.7

0.5

0.5

0.02

0.8

0.8

0.02 0.015

0.01

(a) C1 =0.2

0

C2

0.005 0

C3

(b) C1 =0.5

0

0.01

0.005

0.005 0

C3

0.02 0.015

0.01

0.01

0.005

C2

0.015

0

0

C2

(c) C1 =0.8

Figure 12: Recognition performance with different parameters of C1 and C2 on ETH80 dataset

C1=0.2

C1=0.8

C1=0.5

1

1

1

1

0.9

0.8

0.8 0.7

0.6

0.6

0.4

0.5

0.2

0.02

0.8

0.8 0.7

0.6

0.6 0.5

0.02 0.015

0.01

0.01

0.005 C3

0

0.6

0.4

0.6

0.4

0.2

0.5

0

(a) C1 =0.2

C2

0.2

0.02 0.015

0.02 0.015

0.01

0.01

0.005

0.005 0

0.8

0.8 0.7

0.02

0.015

Recognition rate

0.9 Recognition rate

Recognition rate

1

1

0.9

C3

0

0.015

0.02 0.01

0

(b) C1 =0.5

C2

0.01

0.005

0.005 0

0.015

C3

0.005 0

0

(c) C1 =0.8

Figure 13: Recognition performance with different parameters of C1 and C2 on Gesture dataset

15

C2

0

297

space in which the energy of each model can be preserved as much as possible and the class similarity is reflected. Moreover, we adopt the L2,1 norm to achieve sparse feature learning. Finally, we conduct extensive experiments on different visual classification tasks to evaluate the classification performance of our JDRML. The experimental results clearly indicate that the proposed joint learning method outperforms the other state-of-the-art set based classification methods and needs less iterations to achieve convergence. Besides, although our method has achieved promising results, there are still some aspects that deserve study in the future. First, the proposed model only learns a single metric matrix, which may not be powerful enough to exploit the specific information of each modual representation. We can further adopt a multimetric learning method to learn multiple model-specific metric matrices in the resulted space. Second, as a valid kernel parameter is generally difficult to select, this motivates us to learn an optimal kernel from a set of base kernels by using multiple kernel learning technique.

298

Acknowledgment

288 289 290 291 292 293 294 295 296

299

This work was supported by the National Natural Science Foundation of China (Project No.61673220 and 61772272).

300

Reference

301

References

302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343

[1] M. J. Lyons, J. Budynek, S. Akamatsu, Automatic classification of single facial images, IEEE transactions on pattern analysis and machine intelligence 21 (12) (1999) 1357–1362. [2] M. Korytkowski, L. Rutkowski, R. Scherer, Fast image classification by boosting fuzzy classifiers, Information Sciences 327 (2016) 175–182. [3] C. Zhang, J. Cheng, Y. Zhang, J. Liu, C. Liang, J. Pang, Q. Huang, Q. Tian, Image classification using boosted local features with random orientation and location selection, Information Sciences 310 (2015) 118–129. [4] H. Cevikalp, B. Triggs, Face recognition based on image sets, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2567–2573. [5] R. Wang, S. Shan, X. Chen, W. Gao, Manifold-manifold distance with application to face recognition based on image set, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8. [6] L. Chen, Dual linear regression based classification for face cluster recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2673–2680. [7] W. Wang, R. Wang, Z. Huang, S. Shan, X. Chen, Discriminant analysis on riemannian manifold of gaussian distributions for face recognition with image sets, IEEE Transactions on Image Processing 27 (1) (2018) 151–163. [8] P. Zheng, Z.-Q. Zhao, J. Gao, X. Wu, A set-level joint sparse representation for image set classification, Information Sciences 448 (2018) 75–90. [9] M. Harandi, M. Salzmann, M. Baktashmotlagh, Beyond gauss: Image-set matching on the riemannian manifold of pdfs, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4112–4120. [10] O. Yamaguchi, K. Fukui, K.-i. Maeda, Face recognition using temporal image sequence, in: Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, IEEE, 1998, pp. 318–323. [11] E. OJE, Subspace methods of pattern recognition, in: Pattern recognition and image processing series, Vol. 6, Research Studies Press, 1983. [12] T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (6) (2007) 1005–1018. [13] R. Wang, X. Chen, Manifold discriminant analysis, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 429–436. [14] Y. Hu, A. S. Mian, R. Owens, Sparse approximated nearest points for image set classification, in: Computer vision and pattern recognition (CVPR), 2011 IEEE conference on, IEEE, 2011, pp. 121–128. [15] M. Yang, P. Zhu, L. Van Gool, L. Zhang, Face recognition based on regularized nearest points between image sets, in: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, IEEE, 2013, pp. 1–7. [16] G. Shakhnarovich, J. W. Fisher, T. Darrell, Face recognition from long-term observations, in: European Conference on Computer Vision, Springer, 2002, pp. 851–865. [17] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, T. Darrell, Face recognition with image sets using manifold density divergence, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 2005, pp. 581–588. [18] M. Zhang, R. He, D. Cao, Z. Sun, T. Tan, Simultaneous feature and sample reduction for image-set classification., in: AAAI, Vol. 16, 2016, pp. 1401–1407. [19] S. A. Shah, U. Nadeem, M. Bennamoun, F. Sohel, R. Togneri, Efficient image set classification using linear regression based image reconstruction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 99–108. [20] S. A. A. Shah, M. Bennamoun, F. Boussaid, Iterative deep learning for image set based face and object recognition, Neurocomputing 174 (2016) 866–874. [21] Z. Huang, R. Wang, S. Shan, X. Chen, Face recognition on large-scale video in the wild with hybrid euclidean-and-riemannian metric learning, Pattern Recognition 48 (10) (2015) 3113–3124. [22] J. Lu, G. Wang, P. Moulin, Localized multifeature metric learning for image-set-based face recognition, IEEE Transactions on Circuits and Systems for Video Technology 26 (3) (2016) 529–540.

16

394

[23] Z. Huang, R. Wang, S. Shan, L. Van Gool, X. Chen, Cross euclidean-to-riemannian metric learning with application to face recognition from video, IEEE transactions on pattern analysis and machine intelligence 40 (12) (2018) 2827–2840. [24] X. Gao, Q. Sun, H. Xu, D. Wei, J. Gao, Multi-model fusion metric learning for image set classification, Knowledge-Based Systems 164 (2019) 253–264. [25] Y. Wu, Y. Jia, P. Li, J. Zhang, J. Yuan, Manifold kernel sparse representation of symmetric positive-definite matrices and its applications, IEEE Transactions on Image Processing 24 (11) (2015) 3729–3741. [26] G. Feng, H. Li, J. Dong, J. Zhang, Face recognition based on volterra kernels direct discriminant analysis and effective feature classification, Information Sciences 441 (2018) 187–197. [27] P. Zheng, Z.-Q. Zhao, J. Gao, X. Wu, Image set classification based on cooperative sparse representation, Pattern Recognition 63 (2017) 206–217. [28] Z. Huang, R. Wang, S. Shan, X. Chen, Projection metric learning on grassmann manifold with application to video based face recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 140–149. [29] S. Liao, Y. Hu, S. Z. Li, Joint dimension reduction and metric learning for person re-identification, arXiv preprint arXiv:1406.4216 (2014). [30] M. Harandi, M. Salzmann, R. Hartley, Joint dimensionality reduction and metric learning: A geometric take, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1404–1413. [31] H. Hotelling, Relations between two sets of variates, in: Breakthroughs in statistics, Springer, 1992, pp. 162–190. [32] W. Wang, R. Wang, S. Shan, X. Chen, Prototype discriminative learning for face image set classification, in: Asian Conference on Computer Vision, Springer, 2016, pp. 344–360. [33] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE transactions on pattern analysis and machine intelligence 32 (11) (2010) 2106–2112. [34] Q. Feng, Y. Zhou, R. Lan, Pairwise linear regression classification for image set retrieval, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4865–4872. [35] S. Chen, C. Sanderson, M. T. Harandi, B. C. Lovell, Improved image set classification via joint sparse approximated nearest subspaces, in: Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, 2013, pp. 452–459. [36] H. Hu, Sparse discriminative multimanifold grassmannian analysis for face recognition with image sets, IEEE Transactions on Circuits and Systems for Video Technology 25 (10) (2015) 1599–1611. [37] Z. Huang, R. Wang, S. Shan, X. Li, X. Chen, Log-euclidean metric learning on symmetric positive definite manifold with application to image set classification, in: International conference on machine learning, 2015, pp. 720–729. [38] R. Wang, H. Guo, L. S. Davis, Q. Dai, Covariance discriminative learning: A natural and efficient approach to image set classification, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2496–2503. [39] M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification, IEEE transactions on pattern analysis and machine intelligence 37 (4) (2015) 713–727. [40] J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1137–1145. [41] F. Nie, H. Huang, X. Cai, C. H. Ding, Efficient and robust feature selection via joint l21-norms minimization, in: Advances in neural information processing systems, 2010, pp. 1813–1821. [42] M. A. Kumar, M. Gopal, Least squares twin support vector machines for pattern classification, Expert Systems with Applications 36 (4) (2009) 7535–7543. [43] M. Kim, S. Kumar, V. Pavlovic, H. Rowley, Face tracking and recognition with visual constraints in real-world videos, in: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8. [44] A. S. Georghiades, P. N. Belhumeur, D. J. Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE transactions on pattern analysis and machine intelligence 23 (6) (2001) 643–660. [45] Z. Huang, S. Shan, R. Wang, H. Zhang, S. Lao, A. Kuerban, X. Chen, A benchmark and comparative study of video-based face recognition on cox face database, IEEE Transactions on Image Processing 24 (12) (2015) 5967–5981. [46] B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, Vol. 2, IEEE, 2003, pp. II–409. [47] T.-K. Kim, R. Cipolla, Canonical correlation analysis of video volume tensors for action categorization and detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (8) (2009) 1415–1428. [48] L. Deng, The mnist database of handwritten digit images for machine learning research [best of the web], IEEE Signal Processing Magazine 29 (6) (2012) 141–142. [49] P. Viola, M. J. Jones, Robust real-time face detection, International journal of computer vision 57 (2) (2004) 137–154.

395

Appendix

344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393

396 397 398

Proof of Lemma 1: In order to proof the convergence of the optimization problem in Eq. (26), we first introduce the Lemma 2 as follows: Lemma 2. For any two non-zero constants u and v, the following inequality holds: kvk22 kuk22 ≤ kvk2 − . kuk2 − 2kvk2 2kvk2

399

17

(35)

400

The detailed proof of Lemma 2 is similar to that in [41]. Suppose that the solution of min J = tr(MT UM) can be obtained by solving the generalized eigenvalue problem. M

Then,

Mt+1 = min J = tr(MT UM). M

(36)

Thus, tr(MTt+1 UMt+1 ) ≤ tr(MTt UMt ).

(37)

It can be inferred that tr(MTt+1 UMt+1 ) + tr(C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ tr(MTt UMt ) + tr(C1 G1 + C2 G2 + C3 G3 )Mt . Then, we can obtain X kmk k2 t+1 2 2kmkt k2

k

401

+ tr(C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤

k

2kmkt k2

+ tr(C1 G1 + C2 G2 + C3 G3 )Mt .

(39)

According to Lemma 2, for each k we have kmkt+1 k2 −

402

X kmkt k2 2

(38)

X kmk k2 t+1 2 k

2kmkt k2

≤ kmkt k2 −

X kmkt k2 2

2kmkt k2

k

.

(40)

Thus the following inequality holds: kmkt+1 k2 −

X kmk k2 t+1 2 k

2kmkt k2

+ tr(C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ kmkt k2 −

X kmkt k2 2 k

2kmkt k2

+ tr(C1 G1 + C2 G2 + C3 G3 )Mt .

(41) 403

Combining Eq. (39) and Eq. (41), we have kmkt+1 k2 + tr(C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ kmkt k2 + tr(C1 G1 + C2 G2 + C3 G3 )Mt , based on Eq. (3), we obtain kmt+1 k2,1 + tr(C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ kmt k2,1 + tr(C1 G1 + C2 G2 + C3 G3 )Mt .

404

(42)

(43)

Therefore, the convergence of Eq. (26) is proved.

Declaration of Interest We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, ”Joint Dimensionality Reduction and Metric Learning for Image Set Classification”.

18

Wenzhu Yan: Conceptualization, Methodology, Writing - Original Draft. Quansen Sun: Supervision. Huaijiang Sun: Writing - Review & Editing, Supervision. Yanmeng Li: Validation, Formal analysis, Writing - Original Draft.

Copyright © 2020 COEK.INFO. All rights reserved.