Multi-task least squares twin support vector machine for classification

Multi-task least squares twin support vector machine for classification

ARTICLE IN PRESS JID: NEUCOM [m5G;February 11, 2019;15:4] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing ...

944KB Sizes 0 Downloads 47 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 11, 2019;15:4]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Multi-task least squares twin support vector machine for classification Benshan Mei a, Yitian Xu b,∗ a b

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China College of Science, China Agricultural University, Beijing 100083, China

a r t i c l e

i n f o

Article history: Received 27 July 2018 Revised 19 September 2018 Accepted 9 December 2018 Available online xxx Communicated by Prof. Yudong Zhang Keywords: Pattern recognition Multi-task learning Relation learning Least square twin support vector machine

a b s t r a c t With the bloom of machine learning, pattern recognition plays an important role in many aspects. However, traditional pattern recognition mainly focuses on single task learning (STL), and the multi-task learning (MTL) has largely been ignored. Compared to STL, MTL can improve the performance of learning methods through the shared information among all tasks. Inspired by the recently proposed directed multi-task twin support vector machine (DMTSVM) and the least squares twin support vector machine (LSTWSVM), we put forward a novel multi-task least squares twin support vector machine (MTLSTWSVM). Instead of two dual quadratic programming problems (QPPs) solved in DMTSVM, our algorithm only needs to deal with two smaller linear equations. This leads to simple solutions, and the calculation can be effectively accelerated. Thus, our proposed model can be applied to the large scale datasets. In addition, it can deal with linear inseparable samples by using kernel trick. Experiments on three popular multi-task datasets show the effectiveness of our proposed methods. Finally, we apply it to two popular image datasets, and the experimental results also demonstrate the validity of our proposed algorithm. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Multi-task learning (MTL), known as inductive transfer or inductive bias learning [1], has drawn increasing attention in natural language processing, computer vision and bioinformatics [2–4]. Recently, detailed survey of multi-task learning has been published in [5,6]. Compared to traditional single task learning methods, multi-task learning focuses on improving overall performance by discovering relationships among different tasks, while single task learning treats each task independently and ignores the relationships among all tasks. Multi-task learning believes that related tasks share the similar structure and information, which may be useful for improving the overall performance of all tasks [7]. This is especially helpful when a small number of samples are available in each single task, while small tasks may be benefit from other tasks in multi-task learning. Multi-task learning aims at improving the overall performance of all tasks through the shared information among all tasks [5]. Inspired by this idea, a variety of multi-task learning methods have been proposed based on different single task learners. Such as multi-task logistic regression [8], multi-task linear discriminant analysis (MT-LDA) [9], multi-task bayesian methods [10], and multi-task Gaussian process (MTGP) [11]. The well known



Corresponding author. E-mail address: [email protected] (Y. Xu).

boosting method has also been introduced into multi-task learning scenes [12]. Naturally, as a kind of machine learning idea, multitask learning can also be applied into deep learning and reinforcement learning [5]. Generally speaking, the key point in multitask learning is the modelling of features and task relations. Many forms of multi-task learning have been proposed based on different assumptions, including mean regularized multi-task learning, multi-tasking feature learning, multi-task relation learning and other forms of multi-task learning. Some researchers discussed the feature learning in multi-task learning [13,14], while others attempted to learn multi-task relationships [7,15] and to learn a lowrank representation of multi-task learning to capture the task relation [16,17]. Combining feature selection and relationship learning, a multi-task joint sparse and low-rank representation model has been put forward in [18]. Inspired by the task-clustering idea, clustering multi-task is put forward to utilize the cluster information among all tasks [19]. After two decades of development, there are many other novel forms of multi-task learning appeared. Such as calibrated multitask learning (CMTL) [20], federated multi-task learning (FMTL) [21], asynchronous multi-task learning (AMTL) [22] and interactive multi-task relationship learning (IMTRL) [23]. Most of the recently proposed multi-task learning algorithms can be found in recently published surveys [5,6]. Besides, most of these algorithms try to solve specific multi-task learning problems instead of more general problems. Therefore it restricts the application of these methods. In addition, most algorithms are based on comprehensive

https://doi.org/10.1016/j.neucom.2018.12.079 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;February 11, 2019;15:4]

B. Mei and Y. Xu / Neurocomputing xxx (xxxx) xxx

mathematical theory, and can not be easily implemented. These complex algorithms burden most people from understanding. Thus, we need algorithms that are easier to implement. Due to the success of support vector machine (SVM) [24] in single task learning, some researchers have focused on multi-task SVMs [25–27]. The first practice on multi-task support vector machine is regularized multi-task learning (RMTL) [28]. It suggests all tasks share a mean hyperplane. Inspired by RMTL, multi-task one-class SVM (MTOC-SVM) has been discussed in many works [26,27,29]. A multi-task learning framework for one-class classification is proposed in [27], which derives from the one-class ν -SVM (OC-ν -SVM). Based on least squares support vector machine (LSSVM) [30], researchers proposed multi-task least squares support vector machine (MTLS-SVM) [31], which solves a group of linear equations. Multi-task proximal support vector machine (MTPSVM) is proposed in [32], which is based on the proximal support vector machine (PSVM) [33]. Compared to other multi-task SVMs, it has a simple form and achieves a lower computational cost. Recently, by taking the advantage of multi-task learning, multi-task asymmetric least squares support vector machine (MTL-aLS-SVM) is proposed in [34], which uses different kernel functions for different tasks. At last, to deal with multi-class problems, multi-task multi-class SVM is discussed in [35]. Particularly, most multi-task SVMs belong to mean-regularized multi-task learning method, which means all tasks share a common average classification hyperplane. In contrast to these models, few other multi-task leaning forms have been introduced into multi-task SVMs. Inspired by clustered multi-task learning, clustered multi-task support vector regression (Clustered-MT-SVR) is proposed for facial age estimation [36]. Graph-regularized multi-task learning (GB-MTL) has also been introduced in multitask support vector machine [37]. Not recently, a kind of feature selection and shared information discovery (FSSI) model, has been introduced into multi-tasks support vector machine [38], which can learn the shared features and relations among all tasks simultaneously. However, there are few attempts to introduce multi-task learning into TWSVM. It is firstly proposed in [39]. The main idea is to use two hyperplanes to separate the positive and negative samples. Since single task TWSVMs have been discussed in many works [40–43]. Such as least squares twin bounded support vector machine (LS-TBSVM) [44], least squares recursive projection twin support vector machine (LSRP-TWSVM) [45,46], nonparallel support vector machine (NPSVM) [47]. Inspired by multi-task learning, multi task twin support vector machine (DMTSVM) [48] is proposed, which also employs the mean-regularized method. DMTSVM supposes all tasks share two mean hyperplanes. It is different from multi-task SVMs. Recently, to deal with outlier samples in each task, multi-task centroid twin support vector machines (MCTSVM) is put forward in [49]. Inspired by these models and the efficiency of least squares twin support vector machine (LSTWSVM), we propose a multi-task LSTWSVM in this paper. The contributions of our method are as follows:

our multi-task model is present in Section 3.1. Then, we extend our model into nonlinear situation in Section 3.2. An accelerating approach is provided in Section 3.3. Section 4 is about numerical experiments on three benchmark datasets and two popular image recognition datasets. Finally, our conclusions and future works are present Section 5.

(i) We proposed a novel MTLS-TWSVM based on LSTWSVM, which inherits the merits of multi-task learning and improves the classification performance. (ii) The proposed model only needs to solve a pair of small linear equations. Thus it runs faster than DMTSVM and MCTSVM. (iii) The calculation of our proposed multi-task model can be effectively decomposed into a series of small matrix inversions, which is especially important for parallel or distributed computing when there are a large number of small tasks.

where et and e are one vectors of appropriate dimensions. Suppose all tasks share two mean hyperplanes u = [w1 b1 ] and v = [w2 b2 ] . The positive hyperplane belonging to the t-th task can be represented by [w1t , b1t ] = (u + ut ), while the negative hyperplane in the t-th task is [w2t , b2t ] = (v + vt ). Here, ut and vt represent the bias between task t and the common mean vectors u and v, respectively. Then, the optimization problems of DMTSVM can be written as

The rest of this paper is organized as follows. After a briefly review of LSTWSVM and the primal form of DMTSVM in Section 2, our proposed model is introduced in Section 3. The derivation of

2. Related work We give a first overview of the primal LSTWSVM in this section, and then introduce the DMTSVM. These algorithms provide a basic foundation of our proposed method. 2.1. Least squares twin support vector machine The least squares twin support vector machine (LSTWSVM) has been proposed in [50]. This model has simple and fast algorithm for generating binary classifiers based on two non-parallel hyperplanes. Suppose we have a dataset D, including n samples in an m-dimensional real space Rm , which can be represented by an n × m dimensional matrix X. Each sample point xi has a binary output yi ∈ {−1, 1}. Thus, a dataset can be represented as D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}. Here, we introduce simple notions, A = [X p , e], B = [Xn , e], where Xp stands for the positive samples, and Xn represents the negative. Let u = [w1 , b1 ] and v = [w2 , b2 ] , the LSTWSVM can be described as follows,

min u,p

s.t.

1 c Au22 + 1 p p 2 2 −Bu + p = e2 ,

(1)

and

min v,q

s.t.

1 c Bv22 + 2 q q 2 2 Av + q = e1 ,

(2)

where p and q are slack variables. Here, e1 and e2 are vectors of appropriate dimensions. Both c1 and c2 are trade-off parameters, respectively. Finally, the label of a new sample can be determined by the minimum of |x wi + bi |, where i ∈ {1, 2}. 2.2. Multi-task twin support vector machine The multi-task twin support vector machine is proposed in [48], which directly introduces RMTL idea into TWSVM, is also called direct multi-task twin support vector machines (DMTSVMs). This algorithm combines the idea of twin support vector machine and regularized multi-task learning. Suppose positive samples in tth task is represented by Xpt , while Xnt stands for the negative samples in tth task. Meanwhile, Xp are all positive samples, while Xn contains all negative ones. Now, we introduce simple notions

At = [X pt , et ], A = [X p , e], Bt = [Xnt , et ], B = [Xn , e],

min

u,ut ,pt

s.t.

T T  1 1 Au22 + ρt At ut 22 + c1 et pt 2 2 t=1

∀t : −Bt (u + ut ) + pt ≥ et ,

(3)

t=1

pt ≥ 0,

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 11, 2019;15:4]

B. Mei and Y. Xu / Neurocomputing xxx (xxxx) xxx

and

min

v,vt ,qt

s.t.

T T  1 1 Bv22 + λt Bt vt 22 + c2 et qt 2 2 t=1

(4)

t=1

∀t : At (v + vt ) + qt ≥ et , qt ≥ 0,

where c1 , c2 , ρ t and λt are non-negative trade-off parameters, and e1t , e2t are one vectors of appropriate dimensions. Both p, pt , q and qt are slack variables. If ρ t → 0 and λt → 0 simultaneously, all the tasks will be learned unrelated, while ρ t → ∞ and λt → ∞ will restrict the models to be same. Then, the decision function for a new sample x belonging to tth task is given by

f (x ) = arg min |x wrt + brt |.

In this paper, we assume all tasks share the common parameter to measure the relationships among tasks, which is different from one in [48,49]. Now, we also introduce mean-regularized multitask learning into LSTWSVM. 3.1. Linear multi-task least squares twin support vector machine In this section, suppose all positive and negative samples are represented by Xp and Xn , respectively. The positive and negative samples come from the tth task can be represented by Xpt and Xnt , respectively. The definition of A, At , B and Bt are the same as that used in Section 2.2. The definitions of u, ut , v and vt are also the same as used in Section 2.2. Then, our proposed model can be written as T T 1 ρ  c   Au22 + At ut 22 + 1 pt pt 2 2T 2

u,ut ,pt

t=1

s.t.

(6)

t=1

∀t : −Bt (u + ut ) + pt = e2t ,

and

min

v,vt ,qt

s.t.

T T 1 λ  c   Bv22 + Bt vt 22 + 2 qt qt 2 2T 2 t=1

(7)

t=1

where c1 , c2 , ρ and λ are positive trade-off parameters. Different from the DMTSVM, our proposed model solves the above problems, which replaces the inequality constraints on pt and qt with least squares form. Then, for each task we give an equal role in our algorithm, which is represented by ρ and λ, also is different from that in DMTSVM. The larger ρ and λ will restrict the ut and vt to be small. This leads to T similar models will be learned. Meanwhile, the smaller ρ and λ are, the difference among all tasks tends to be large. Besides, the task number T is considered in our model, which is in consist with RMTL. Although, the changes are simple, but it only solves two equality constrained problems, which leads to two linear equations and can be easily solved. Now, we introduce Lagrangian multipliers into above formulations. The Lagrangian function of the problem (6) is given by

1 ρ L1 = Au22 + 2 2T −

α

 t [−Bt

∂L = c1 pt − αt = 0. ∂ pt

(11)

From the equations above, we obtain

u = −(A A )−1 B α ,

ut = −

T  t=1

T c   At ut 22 + 1 pt pt 2 t=1

(u + ut ) + pt − e2t ],

(8)

t=1

where α t is the Lagrangian multiplier. Differentiating the Lagrangian function (8) with respect to u, ut and pt yields the following Karush-Kuhn-Tucker (KKT) conditions:

∂L = A Au + B α = 0, ∂u

(9)

T

ρ

αt c1

(12)

(At At )−1 Bt αt ,

(13)

.

(14)

By substituting u, ut and pt into the equation constraint in (6) with above equations, we get

Bt ((A A )−1 B α +

T

ρ

(At At )−1 Bt αt ) +

αt c1

= e2t ,

(15)

where t ∈ {1, 2, . . . , T } and α = [α1 , α2 , . . . , αt ] . Here, we define

Q = B(A A )−1 B ,

(16)

Pt = Bt (At At )−1 Bt ,

(17)

P = blkdiag(P1 , P2 , . . . , Pt ),

(18)

and an identical matrix I. The dimension of I is the same as Q. Then (15) can be transformed into following formulation:

  T 1 −1 α = Q + P + I e2 . ρ c1

(19)

Once the dual variable α is solved. The classifier parameters u and ut of the tth task can be determined. Similarly, the Lagrangian function of (7) can be written as

L2 =

∀t : At (v + vt ) + qt = e1t ,

T 

(10)

pt =

3. Multi-task least squares twin support vector machine

min

∂L ρ = A A u + Bt αt = 0, ∂ ut T t t t

(5)

r=1,2

3

T T 1 λ  c   Bv22 + Bt vt 22 + 2 qt qt 2 2T 2 t=1



T 

t=1

γt [At (v + vt ) + qt − e1t ],

(20)

t=1

where γ t is the Lagrangian multiplier. After differentiating the Lagrangian function (20) with respect to variables v, vt and qt , we obtain

At ((B B )−1 A γ +

T

λ

(Bt Bt )At γt ) +

γt c2

= e1t ,

(21)

where γ = [γ1 , γ2 , . . . , γt ] . Let

R = A(B B )−1 A ,

(22)

St = At (Bt Bt )−1 At ,

(23)

S = blkdiag(S1 , S2 , ..., St ),

(24)

the dual variable γ can be obtained by

  T 1 −1 γ = R + S + I e1 . λ c2

(25)

Then, the negative hyperplane can be obtained. Meanwhile, the label of a new sample x in the tth task can be determined by

f (x ) = arg min |x wrt + brt |. r=1,2

(26)

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;February 11, 2019;15:4]

B. Mei and Y. Xu / Neurocomputing xxx (xxxx) xxx Table 1 Statistics of three datasets.

3.2. Nonlinear multi-task least squares twin support vector machine It is natural that a linear classifier may not be suitable for training samples that are linear inseparable. The kernel trick can be used to cope with such problems. Now, we introduce the kernel function and define

Name

#Attributes

#Instances

#Classes

#Tasks

Isolet Monk Landmine

617 7 9

9674 432 9674

26 2 2

5 3 29

E = [K (A, Z  ), e], Et = [K (At , Z  ), et ], F = [K (B, Z  ), e], Ft = [K (Bt , Z  ), et ].

And the inversion of D in (33) can also be accelerated by

Here, K( · ) stands for a specific kernel function. Z represents training samples from all tasks, that is, Z =       [A 1 , B1 , A2 , B2 , . . . , At , Bt ] . The primal problems of the nonlinear model are given as T T 1 ρ  c   Eu22 + Et ut 22 + 1 pt pt 2 2T 2

min

u,ut ,pt

t=1

(27)

t=1

∀t : −Ft (u + ut ) + pt = e2t ,

s.t. and

T T 1 λ  c   F v22 + Ft vt 22 + 2 qt qt 2 2T 2

min

v,vt ,qt

t=1

(28)

t=1

∀t : Et (v + vt ) + qt = e1t ,

s.t.

where c1 and c2 are non-negative trade-off parameters. pt and qt are slack variables, both e1t and e2t are one vectors. The decision function of the tth task is

f (x ) = arg min |K (x, Z  )wrt + brt |.

(29)

r=1,2

3.3. Calculating optimizations It is clear that we need to calculate many large matrix inversions in the above equations. The time consuming will be considerable large, when there are a large number of samples. By analysing these equations, we find the calculation of Lagrangian multipliers, α and γ , has nice characters, and can be easily solved by introducing Sherman-Morrison-Woodbury (SMW) formula. The calculation of above two equations can be directly optimized. According to [50], A general formula is

(D + UCV )−1 = D−1 − D−1U (C −1 + V D−1U )−1V D−1 . Now, we introduce the following notions

D=

T

ρ

P+

1 I, U = B, C = (A A )−1 , V = B . c1

(30)

The calculation of α can be reformulated as

α = (D−1 − D−1 B(A A + B D−1 B )−1 B D−1 )e2 .

(31)

Similarly, for the dual variable γ , we define

D=

T

λ

S+

1 I, U = A, C = (B B )−1 , V = A . c2

(32)

The calculation of γ can be written as

γ = (D−1 − D−1 A(B B + A D−1 A )−1 A D−1 )e1 .

(33)

Both matrices D in (31) and (33) are block-wise diagonal matrices and full of zero entries. The D−1 can be efficiently calculated by block-wise inversion of D, and the calculation of relevant equations can be accelerated by sparse matrix. However, further analysing the definition of each diagonal block Dt in D, we notice it can be speeded up by reusing the SMW formula. Then, the inversion of D in (31) can be obtained by



Dt−1

= c1 It − c1 Bt

ρ T

At At

+

c1 Bt Bt

−1



Bt

.

(34)



Dt−1

= c2 It − c2 At

γ T

Bt Bt

+

c2 At At

−1



At

.

(35)

According to the definition of D, we find the number of diagonal blocks in D is equal to the task number. Suppose there are many small tasks, the effect of this quick algorithm can be significant in contrast to no quick method is used. Besides, if we carefully organize the calculating procedures of this model, we find no more than 4T + 4 times of matrix inversions are necessary, where T is the number of all tasks. The dimensions of these matrices are (m + 1 ) × (m + 1 ), where m is the number of features. This can be explained as follows. Firstly, we need 2T times calculation to solve the inversion of D in α and γ . As for ut and vt , we need to calculate T times inversion of At At and Bt Bt , respectively. Besides, to solve u and v, the inversions of A A and B B are required. The inversion of A A + B D−1 B and B B + A D−1 A are also needed. Thus, only 4T + 4 times of matrix inversions are necessary. In addition, each inversion is carried on a small matrix, their dimensions are related with the features, and have no relationship with the number of samples in each task or in all tasks. Suppose lots of tasks are in our experiments, it can produce quite big improvement on computational efficiency. 4. Numerical experiments In this section, we present comparative experimental results on five traditional STL and five MTL methods. The STL algorithms consist of SVM, PSVM, LSSVM, T WSVM and LST WSVM, while the MTL methods contains MTPSVM, MTLS-SVM, MTL-aLS-SVM, DMTSVM and MCTSVM. The experiments are first conducted on three benchmark datasets. To further evaluate our method, we have also conducted extensive experiments and made comparisons on popular Caltech image datasets. Since multi-task image classification is more complex. It is worthwhile to make such performance comparison on image datasets. For each algorithm, all parameters, such as λ, γ and ρ , are turned by grid-search strategy. The parameters are selected from set {2i |i = −3, −2, . . . , 8} without specific explanation. The parameter μ in MTL-aLS-SVM is selected from set {0.83, 0.9, 0.97}. In addition, the Gaussian kernel function is employed in the experiments. Both multi-task average classification accuracy and training time are our evaluation indicators. Then, we use three-fold cross-validation on those datasets to get an average performance. Finally, all experiments are performed in Matlab R2018b on Windows 8.1 running on a PC with system configuration of Intel(R) Core(TM) i3−610 0 CPU(3.90 GHz) with 12.0 0 GB of RAM. 4.1. Benchmark datasets We select three popular multi-task datasets to test these algorithms, including Isolet, Monk and Landmine dataset, which have been widely used to evaluate multi-task learning methods [31,32,48]. The first two datasets come from the UCI Machine Learning Repository. A summary is provided in Table 1. The details are as follows. Isolet It is a widely used in spoken recognition dataset, which is generated as follows. 150 subjects speak each letter of the

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 11, 2019;15:4]

B. Mei and Y. Xu / Neurocomputing xxx (xxxx) xxx

5

Table 2 Performance comparison over all benchmark datasets. Type

Method

Landmine

Isolet Time (ms)

Accuracy

Time (ms)

Accuracy

Time (ms)

STL

SVM PSVM LS-SVM TWSVM LSTWSVM

76.19 ± 5.37 71.67 ± 7.09 75.25 ± 6.49 78.35 ± 4.88 78.44 ± 5.42

53.01 2.08 2.34 98.96 4.63

87.17 ± 10.35 98.50 ± 0.91 98.50 ± 0.91 98.50 ± 1.09 98.67 ± 1.12

17.68 2.12 2.34 34.29 4.09

94.75 ± 5.58 78.63 ± 4.20 89.51 ± 7.40 97.69 ± 3.24 92.98 ± 9.76

49.57 9.28 11.21 64.58 22.83

MTL

MTPSVM MTLS-SVM MTL-aLS-SVM DMTSVM MCTSVM

76.73 ± 5.61 76.82 ± 5.73 79.14 ± 4.82 79.32 ± 5.61 80.00 ± 6.24

79.96 46.86 766.81 646.66 731.93

99.50 ± 0.46 99.50 ± 0.46 99.83 ± 0.37 99.67 ± 0.46 99.67 ± 0.46

9.78 11.68 109.50 68.49 83.50

89.20 ± 7.75 89.20 ± 7.75 93.29 ± 10.63 98.23 ± 2.88 98.30 ± 2.94

42.15 58.45 923.45 352.06 377.21

Ours

MTLS-TWSVM

79.41 ± 5.83

647.83

99.83 ± 0.37

73.68

92.90 ± 9.89

330.08

alphabet twice. Thus, there are 52 training samples from each speaker. All speakers are grouped into five sets, and each set contains 30 speakers. Thus, we have five groups of data, which can be seen as five tasks with a highly relativity. They are distinct from each other in the way of speaking. But the alphabets in each task have similarity in accent. In this paper, we select two alphabets for our experiments, each task is to distinguish this pair alphabets. To improve the computational efficiency, we reduce the dimensionality from 617 to 242 with PCA by capturing 97% of the information. Monk This dataset comes from the first international comparison of learning algorithms, in which contains three Monk’s problems corresponding to three tasks. Besides, the domains of all tasks are the same. Thus, these tasks can be seen as related. We use the whole dataset to evaluate these eleven algorithms in this paper. Then, we show further performance evaluation on this dataset. Landmine 1 This dataset contains 29 binary classification tasks. Each task is represented by a 9-dimensional feature vector extracted from radar images that capture a single region of landmine fields. The first 15 tasks are corresponding to regions that are highly foliated, while the last 14 tasks are from regions that are bare earth or deserted. The goal is to detect land-mines in specific regions. Since tasks extracted from the same region can be seen as a set of associated tasks. Therefore, we select all tasks from foliated regions in our experiments. Because of the highly unbalance of dataset, i.e., the negative samples are far more than the positive samples. We remove some negative samples to reduce the imbalance. We first conducted comparative experiments on three benchmark datasets. Table 2 shows the performance comparison between our method and other STL or MTL methods. It can be noticed that the multi-task average classification accuracy of the MTL methods outperform the STL methods on three datasets in most cases. But all these MTL methods take more training time when compared to those STL methods. It is because all tasks are trained simultaneously in MTL methods, while STL methods treat this task independently. Thus those MTL methods may benefit more from the underlying information among all tasks, but with more time consumed. We also find both MTPSVM and MTLS-SVM run faster than other MTL methods significantly, when taking MTL methods into consideration only. However, their average accuracy is lower than the other MTL algorithms clearly. In addition, although MTLaLS-SVM has a comparable classification accuracy with multi-task TWSVMs, its training time is the highest among all methods. But the classification performance on the Monk dataset shows inconsistency. Since there are more samples in this dataset, the classification accuracy of STL methods will be increased. Some of STL methods may perform even better than MTL methods. The perfor1

Monk

Accuracy

Landmine: http://people.ee.duke.edu/∼lcarin/LandmineData.zip.

Fig. 1. Accuracy comparison between our method with other STL methods on Monk dataset with varying task size.

mance of MTL methods, including ours, will be degraded. This can be interpreted as follows. Since there are only three tasks in Monk dataset, and each task contains more samples than Isolet and Landmine datasets, our model may be misled by the shared information in these three tasks. In addition, we can not confirm that these three tasks are related. Learning unrelated tasks may have negative impact on these multi-task models, but single task methods will not be affected with those unrelated tasks. To better evaluate the performance of our algorithm, we conducted experiments on Monk dataset with varying number of samples. The number of samples in each task ranges from 40 to 180. Fig. 1 describes the accuracy comparison between our model and STL models. The solid line represents the result of our method. From it, we find our algorithm get better performance when the task size is relatively small. As the task size increases, they achieve almost the same performance, which can be interpreted as small task size provides less information for single task learning methods, while the multi-task method can exploit more information by simultaneously learning those tasks. The accuracy comparison between our method and other MTL methods is depicted in Fig. 2. We can notice that our MTLSTWSVM outperforms traditional multi-task SVMs in most cases. Besides, our methods get a comparable performance when compared to other multi-task TWSVMs. In addition, our algorithm also performs better than DMTSVM when there are relatively small samples in each task. This figure demonstrates that our algorithm

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079

JID: NEUCOM 6

ARTICLE IN PRESS

[m5G;February 11, 2019;15:4]

B. Mei and Y. Xu / Neurocomputing xxx (xxxx) xxx

Fig. 4. Sample images in Caltech datasets for multi-task learning.

4.2. Image recognition datasets

Fig. 2. Accuracy comparison among MTL methods on Monk dataset with varying task size.

Fig. 3. Training time comparison among MTL methods on Monk dataset with varying task size.

can better utilize the information provided by all tasks, while the multi-task SVMs can not. In this paragraph, we only compare our model with MTL methods. Fig. 3 shows the average training time comparison among multi-task learning methods. We find our algorithm is faster than MTL-aLS-SVM significantly. Meanwhile, our method also runs faster than other two multi-task TWSVMs. Besides, the average training time of MTL-aLS-SVM increases rapidly, when compared to other MTL methods. In addition, with the number of samples in each task increasing, other MTL methods take more traning time, when compared to MTPSVM and MTLS-SVM. In contrast to the other multi-task models, the curve of MTPSVM and MTLS-SVM are more steady. We could draw a conclusion from Figs. 2 and 3. Among all MTL methods, our algorithm has a comparable classification performance when compared with the other multi-task TWSVMs, but with a relatively low computational cost. Although both MTLSSVM and MTPSVM run faster than the other MTL methods. But they both show poor classification accuracy. In addition, the average training time of MTL-aLS-SVM is the highest among all MTL methods.

In this section, we conduct extensive experiments on popular Caltech image dataset, including the Caltech 101 [51] and the Caltech 256 datasets [52]. Caltech 101 dataset contains 102 categories, about 40 to 800 images per category, such as air-planes, butterfly, flamingo and leopards. There are about 50 images in majority of categories. The size of each image is about 300 × 200 pixels. In addition, there is a group of annotations for each image [53]. Each group of annotations contains two pieces of information: the general bounding box in which the object is located, and a detailed human-specified outline enclosing the object. In addition, the Caltech 256 dataset has 256 categories of images and a clutter class. The clutter category can be seen as noises or backgrounds. There are at least 80 images per category, but no image annotation is provided. We show some sample images in Fig. 4. Every column of samples comes from the same superclass, but each row in the same column belongs to different subclasses. From it, we notice that samples belonging to the same superclass share similar features. For example, the balls in different categories vary in colours but have similarity in shapes. Besides, all bikes have two wheels, and all flowers have similar textures. In addition, all mammals have a head, two eyes and four or two legs, sometimes with a hairy tail. As a complementary, the birds have similar texture in feathers as well as two slim legs. Thus, recognizing those objects in different subcategories but in a common superclass can be seen as a group of related tasks. Therefore, we can improve the classification performance of multi-task learning methods by training these related tasks simultaneously. Due to the lack of samples in majority class in the Caltech 101 dataset, we select five superclass of images to form five multi-task datasets. Each dataset contains samples from relevant subclasses, and we use no more than 50 samples from each subcategory. In other words, there are no more than 100 samples in each binary classification task. As for the Caltech 256 dataset, more categories and images are provided. We select ten superclass of images to form ten multi-task datasets. The samples of each dataset are selected from corresponding subclasses, and we use no more than 80 samples from each subclass. As for the image feature extraction, the popular dense-sift algorithm is used to extract features form those images. Then, we quantize those features into 10 0 0 visualwords with bag-of-word models. In the following, we reduce the feature vectors to an appropriate dimensions with PCA, to retain 97% of the variance. Thus, we can use kernel trick to improve the classification accuracy. Finally, the goal of each task is to distinguish objects in the subcategory from clutter category. The experimental results on the Caltech-101 dataset are shown in Tables 3 and 4. The comparison on average training time is similar to that in previous section. However, we find our algorithm show poor performance on this group of datasets.

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079

ARTICLE IN PRESS

JID: NEUCOM

[m5G;February 11, 2019;15:4]

B. Mei and Y. Xu / Neurocomputing xxx (xxxx) xxx

7

Table 3 Accuracy (%) comparison between multi-task methods on Caltech 101 image dataset and the MAPs over all category. Category

#Tasks

MTPSVM

MTLS-SVM

MTL-aLS-SVM

DMTSVM

MCTSVM

MTLS-TWSVM

Birds Insects Flowers Mammals Instruments MAP

5 4 3 10 6 28

74.44 ± 3.41 77.86 ± 7.24 77.67 ± 6.65 84.09 ± 6.53 71.99 ± 7.32 77.21 ± 4.55

73.83 ± 3.08 77.36 ± 5.88 79.60 ± 3.57 83.85 ± 6.27 71.96 ± 9.40 77.32 ± 4.71

75.26 ± 2.94 78.97 ± 6.11 80.28 ± 4.10 85.04 ± 5.19 76.98 ± 6.49 79.31 ± 3.73

74.47 ± 2.26 75.96 ± 6.55 77.23 ± 3.24 82.76 ± 4.18 77.28± 5.82 77.54 ± 3.14

75.28 ± 1.56 77.32 ± 6.96 79.71 ± 2.04 85.17 ± 5.08 77.77 ± 5.01 79.05 ± 3.77

74.44 ± 2.49 75.96 ± 5.83 78.73 ± 4.53 83.03 ± 3.64 77.09 ± 5.75 77.85 ± 3.29

Table 4 Training time (ms) comparison between multi-task methods on Caltech 101 image dataset. Category

#Tasks

MTPSVM

MTLS-SVM

MTL-aLS-SVM

DMTSVM

MCTSVM

MTLS-TWSVM

Birds Insects Flowers Mammals Instruments

5 4 3 10 6

13.98 7.44 3.18 68.98 22.34

6.76 5.81 3.64 36.37 14.73

76.56 40.49 20.09 502.43 129.68

46.57 28.70 15.74 386.22 88.99

55.01 34.84 18.85 454.60 111.20

41.45 23.57 12.27 389.61 88.31

Table 5 Accuracy (%) comparison between multi-task methods on Caltech 256 image dataset and the MAPs over all category. Category

#Tasks

MTPSVM

MTLS-SVM

MTL-aLS-SVM

DMTSVM

MCTSVM

MTLS-TWSVM

Aircrafts Balls Bikes Birds Boats Flowers Instruments Mammals Plants Vehicles MAP

4 5 6 9 4 3 5 10 4 9 59

81.87 ± 2.47 79.74 ± 5.01 78.77 ± 4.69 77.35 ± 2.61 75.93 ± 4.17 86.03 ± 1.58 80.77 ± 4.75 75.96 ± 4.03 76.88 ± 5.04 80.40 ± 4.88 79.37 ± 3.13

81.57 ± 2.31 78.02 ± 2.99 76.37 ± 4.92 77.00 ± 2.79 75.93 ± 4.50 84.59 ± 2.17 80.03 ± 4.51 74.96 ± 4.24 76.41 ± 5.23 80.33 ± 3.94 78.52 ± 3.03

81.72 ± 2.18 78.50 ± 5.51 84.28 ± 3.00 84.44 ± 2.43 79.21 ± 1.17 89.18 ± 2.19 83.14 ± 4.37 82.82 ± 4.23 82.36 ± 2.87 83.75 ± 3.60 82.94 ± 2.97

78.91 ± 1.65 77.25 ± 2.61 84.07 ± 3.64 84.09 ± 2.71 80.14 ± 2.09 88.98 ± 1.28 82.15 ± 4.39 81.14 ± 3.32 81.11 ± 3.21 82.63 ± 4.12 82.05 ± 3.25

80.31 ± 3.13 79.01 ± 4.28 81.97 ± 2.29 80.97 ± 1.88 79.22 ± 1.05 87.29 ± 0.94 80.02 ± 5.58 79.37 ± 2.87 80.79 ± 3.57 81.18 ± 3.71 81.01 ± 2.40

79.84 ± 4.01 77.25 ± 3.02 84.38 ± 3.86 84.09 ± 2.73 79.04 ± 1.50 88.97 ± 2.93 82.15 ± 3.70 81.45 ± 2.92 81.88 ± 5.30 82.84 ± 4.09 82.19 ± 3.25

Table 6 Training time (ms) comparison between multi-task methods on Caltech 256 image dataset. Category

#Tasks

MTPSVM

MTLS-SVM

MTL-aLS-SVM

DMTSVM

MCTSVM

MTLS-TWSVM

Aircrafts Balls Bikes Birds Boats Flowers Instruments Mammals Plants Vehicles

4 5 6 9 4 3 5 10 4 9

31.15 41.17 41.25 130.33 43.52 10.20 25.43 120.64 54.34 143.96

16.97 24.74 43.37 79.38 15.82 7.06 25.30 104.69 16.49 82.84

186.29 335.23 486.43 1479.64 172.50 80.25 316.43 2120.97 178.26 1613.70

92.55 183.06 286.77 1004.37 85.32 38.56 190.89 1489.74 89.73 1078.49

120.07 219.42 334.65 1143.03 119.10 53.00 213.21 1633.91 114.11 1212.05

91.72 176.87 269.58 1006.43 84.42 34.39 182.36 1519.86 94.69 1097.41

MTL-aLS-SVM achieves the highest accuracy on four in five categories. Besides, it should be pointed out that, our model obtains a comparable performance among three multi-task TWSVMs. Our algorithm even performs better than DMTSVM on multi-task average accuracy with lower computational cost. Finally, according to previous conclusion, we should emphasize the lack of samples in each group of tasks may bring negative impacts on the performance of our algorithm. Tables 5 and 6 illustrate our experimental results on the Caltech 256 dataset. It can be noticed that the MTL-aLS-SVM has accuracy improvements on six of ten categories. However, the computational costs is rather high among these six algorithms. Besides, both our MTLS-TWSVM and DMTSVM achieve comparable performance. Although both MTPSVM and MTLS-SVM run faster than any other methods in our experiments, they show poor performance on eight of ten categories. This is in consistent with previous conclusions on benchmark datasets. In addition, because more samples are provided for each task on this group of experiments, our algorithm achieves comparable performance with MTL-aLS-SVM,

and even better than other algorithms on most categories. While on the Caltech 101 datasets, a small number of samples in each category can be used for each task, the results of our experiments on this dataset are naturally to be poor. At last, the average training time of our method is lower than the other multi-task TWSVMs.

5. Conclusion and future work In this paper, we propose a novel multi-task LSTWSVM. In comparison with the DMTSVM, our method only needs to solve a pair of systems of linear equations instead of a pair of QPPs. It overcomes the shortcoming of DMTSVM and leads to high computational efficiency. Compared with five single learning algorithms and five multi-task learning algorithms, our algorithm achieves better experimental results on three traditional multi-task datasets. In addition, the experimental results on two image recognition datasets also demonstrate the effectiveness of our algorithm. Research on the robustness of multi-task SVMs or TWSVMs is our future work.

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;February 11, 2019;15:4]

B. Mei and Y. Xu / Neurocomputing xxx (xxxx) xxx

Acknowledgments The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. This work was supported in part by the Beijing Natural Science Foundation (No. 4172035) and National Natural Science Foundation of China (No. 11671010). References [1] R. Caruana, Multitask learning, Mach. Learn. Special Issue Ind. Trans. Arch. 28 (1) (1997) 41–75. [2] J.M. Leiva-Murillo, L. Gomez-Chova, G. Camps-Valls, Multitask remote sensing data classification, IEEE Trans. Geosci. Remote Sens. 51 (1) (2013) 151–161. [3] S. Zhao, H. Yao, S. Zhao, X. Jiang, X. Jiang, Multi-modal microblog classification via multi-task learning, Multimed. Tools Appl. 75 (15) (2016) 8921–8938. [4] S. Sun, Multitask learning for eeg-based biometrics, in: Proceedings of the 19th International Conference on Pattern Recognition, 2008, pp. 1–4. [5] Y. Zhang, Q. Yang, in: A survey on multi-task learning, 2017. arXiv:1707.08114. [6] Y. Zhang, Q. Yang, An overview of multi-task learning, Nat. Sci. Rev. 5 (1) (2018) 30–43. [7] Z. Xu, K. Kersting, Multi-task learning with task relations, in: Proceedings of the IEEE 11th International Conference on Data Mining, 2011, pp. 884–893. [8] K.H. Fiebig, V. Jayaram, J. Peters, M. Grosse-Wentrup, Multi-task logistic regression in brain-computer interfaces, in: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2016, pp. 2307–2312. [9] Y. Yan, E. Ricci, R. Subramanian, G. Liu, N. Sebe, Multitask linear discriminant analysis for view invariant action recognition, IEEE Trans. Image Process. 23 (12) (2014) 5599–5611. [10] B. Bakker, T. Heskes, Task clustering and gating for bayesian multitask learning, J. Mach. Learn. Res. 4 (2003) 83–99. [11] K. Yu, V. Tresp, A. Schwaighofer, Learning gaussian processes from multiple tasks, in: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 1012–1019. [12] O. Chapelle, P.K. Shivaswamy, S. Vadrevu, K.Q. Weinberger, Y. Zhang, B.L. Tseng, Boosted multi-task learning, Mach. Learn. 85 (2011) 149–173. [13] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Mach. Learn. 73 (3) (2008) 243–272. [14] P. Gong, J. Ye, C. Zhang, Multi-stage multi-task feature learning, Neural Inf. Process. Syst. 14 (2012) 1988–1996. [15] Y. Li, X. Tian, T. Liu, D. Tao, On better exploring and exploiting task relationships in multitask learning: Joint model and feature learning, IEEE Trans. Neural Netw. 29 (5) (2018) 1975–1985. [16] J. Chen, J. Liu, J. Ye, Learning incoherent sparse and low-rank patterns from multiple tasks, ACM Trans. Knowl. Discovery From Data 5 (4) (2012). 22–22 [17] C. Su, F. Yang, S. Zhang, Q. Tian, L.S. Davis, W. Gao, Multi-task learning with low rank attribute embedding for multi-camera person re-identification, IEEE Trans. Pattern Anal. Mach. Intell. 40 (5) (2018) 1167–1181. [18] K. Qi, W. Liu, C. Yang, Q. Guan, H. Wu, Multi-task joint sparse and low-rank representation for the scene classification of high-resolution remote sensing image, Remote Sens. 9 (1) (2016) 10. [19] J. Zhou, J. Chen, J. Ye, Clustered multi-task learning via alternating structure optimization, Adv. Neural Inf. Process. Syst. 2011 (2011) 702–710. [20] F. Nie, Z. Hu, X. Li, Calibrated multi-task learning, in: Proceedings of the KDD ACM Conference on Knowledge Discovery and Data Mining, 2018, pp. 2012–2021. [21] V. Smith, M. Sanjabi, C. Chiang, A.S. Talwalkar, Federated multi-task learning, Neural Inf. Process. Syst. (2017) 4424–4434. [22] I.M. Baytas, M. Yan, A.K. Jain, J. Zhou, Asynchronous multi-task learning, in: Proceedings of the International Conference on Data Mining, 2016, pp. 11–20. [23] K. Lin, J. Zhou, Interactive multi-task relationship learning, in: Proceedings of the IEEE 16th International Conference on Data Mining (ICDM), 2016, pp. 241–250. [24] C.J. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (2) (1998) 121–167. [25] H.T. Shiao, V.S. Cherkassky, Implementation and comparison of SVM-based multi-task learning methods, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2012, pp. 1–7. [26] Y. Xue, P. Beauseroy, Multi-task learning for one-class svm with additional new features, in: Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), 2016, pp. 1571–1576. [27] H. Yang, I. King, M.R. Lyu, Multi-task learning for one-class classification, in: Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2010, pp. 1–8. [28] T. Evgeniou, M. Pontil, Regularized multi–task learning, in: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 109–117. [29] X. He, G. Mourot, D. Maquin, J. Ragot, P. Beauseroy, A. Smolarz, E. Grall-Maës, Multi-task learning with one-class SVM, Neurocomputing 133 (2014) 416–426. [30] J.A. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett. 9 (3) (1999) 293–300.

[31] S. Xu, X. An, X. Qiao, L. Zhu, Multi-task least-squares support vector machines, Multimed. Tools Appl. 71 (2) (2014) 699–715. [32] Y. Li, X. Tian, M. Song, D. Tao, Multi-task proximal support vector machine, Pattern Recogn. 48 (10) (2015) 3249–3257. [33] G. Fung, O.L. Mangasarian, Proximal support vector machine classifiers, in: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pp. 77–86. [34] L. Lu, Q. Lin, H. Pei, P. Zhong, The als-svm based multi-task learning classifiers, Appl. Intell. 48 (8) (2018) 2393–2407. [35] Y. Ji, S. Sun, Multitask multiclass support vector machines: Model and experiments, Pattern Recogn. 46 (3) (2013) 914–924. [36] P.X. Gao, Facial age estimation using clustered multi-task support vector regression machine, in: Proceedings of the 21st International Conference on Pattern Recognition (ICPR), 2012, pp. 541–544. [37] C. Widmer, M. Kloft, N. Görnitz, G. Rätsch, Efficient training of graph-regularized multitask svms, in: Proceedings of the ECMLPKDD 12th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, 2012, pp. 633–647. [38] S. Wang, X. Chang, X. Li, Q.Z. Sheng, W. Chen, Multi-task support vector machines for feature selection with shared knowledge discovery, Signal Process. 120 (2016) 746–753. [39] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (2007) 905–910. [40] Y. Xu, L. Wang, A weighted twin support vector regression, Knowl. Based Syst. 33 (2012) 92–101. [41] X. Pan, Y. Luo, Y. Xu, K-nearest neighbor based structural twin support vector machine, Knowl. Based Syst. 88 (2015) 34–44. [42] Y. Xu, Z. Yang, X. Pan, A novel twin support-vector machine with pinball loss, IEEE Trans. Neural Netw. Learn. Syst. 28 (2) (2017) 359–370. [43] H. Yan, Q. Ye, T. Zhang, D. Yu, Y. Xu, L1 -norm gepsvm classifier based on an effective iterative algorithm for classification, Neural Process. Lett. 48 (1) (2018a) 273–298. [44] H. Yan, Q. Ye, T. Zhang, D. Yu, X. Yuan, Y. Xu, L. Fu, Least squares twin bounded support vector machines based on l1 -norm distance metric for classification, Pattern Recogn. 74 (2018b) 434–447. [45] Y. Shao, N. Deng, Z. Yang, Least squares recursive projection twin support vector machine for classification, Pattern Recogn. 45 (6) (2012) 2299–2307. [46] R. Yan, R. Yan, Q. Ye, Q. Ye, L. Zhang, N. Ye, X. Shu, A feature selection method for projection twin support vector machine, Neural Process. Lett. 47 (1) (2018) 21–38. [47] Y. Tian, Z. Qi, X. Ju, Y. Shi, X. Liu, Nonparallel support vector machines for pattern classification, IEEE Transactions on Systems, Man, and Cybernetics 44 (7) (2014) 1067–1079. [48] X. Xie, S. Sun, Multitask twin support vector machines, in: Proceedings of the 19th International Conference on Neural Information Processing ICONIP- Volume Part II, 2012, pp. 341–348. [49] X. Xie, S. Sun, Multitask centroid twin support vector machines, Neurocomputing 149 (2015) 1085–1091. [50] M.A. Kumar, M. Gopal, Least squares twin support vector machines for pattern classification, Expert Syst. Appl. 36 (4) (2009) 7535–7543. [51] F. Li, R. Fergus, P. Perona, One-shot learning of object categories, IEEE Trans. Pattern Anal. Mach. Intell. 28 (4) (2006) 594–611. [52] G. Griffin, A. Holub, P. Perona, The Caltech 256, Caltech Technical Report., 2006. [53] F. Li, R. Fergus, P. Perona, Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories, Comput. Vis. Image Underst. 106 (1) (2007) 59–70. Benshan Mei is born in China in 1995. He received the B.E. degree from the College of Computer Science, Faculty of Information Technology, Beijing University of Technology, Beijing, China, in 2017. Now, he is currently pursuing the master degree in the College of Information and Electrical Engineering, China Agricultural University, Beijing, China, from 2017. His current research interests include machine learning, multi-task learning and transfer learning.

Yitian Xu received the Ph.D. degree from the College of Science, China Agricultural University, Beijing, China, in 2007. He is currently a Professor in the College of Science, China Agricultural University. He has authored about 50 papers. His current research interests include machine learning and data mining. Prof. Xu’s research has appeared in IEEE Transactions on Neural Networks and Learning Systems, IEEE Transactions on Cybernetics, Information Science, Pattern Recognition, Knowledge-Based Systems, Neurocomputing, and so on.

Please cite this article as: B. Mei and Y. Xu, Multi-task least squares twin support vector machine for classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.079