Efficient sparse least squares support vector machines for pattern classification

Efficient sparse least squares support vector machines for pattern classification

Computers and Mathematics with Applications ( ) – Contents lists available at ScienceDirect Computers and Mathematics with Applications journal ho...

902KB Sizes 0 Downloads 66 Views

Computers and Mathematics with Applications (

)



Contents lists available at ScienceDirect

Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa

Efficient sparse least squares support vector machines for pattern classification Yingjie Tian ∗ , Xuchan Ju, Zhiquan Qi, Yong Shi Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, China

article

info

Keywords: Least squares support vector machine Sparseness Loss function Classification Regression

abstract We propose a novel least squares support vector machine, named ε -least squares support vector machine (ε -LSSVM), for binary classification. By introducing the ε -insensitive loss function instead of the quadratic loss function into LSSVM, ε -LSSVM has several improved advantages compared with the plain LSSVM. (1) It has the sparseness which is controlled by the parameter ε . (2) By weighting different sparseness parameters ε for each class, the unbalanced problem can be solved successfully, furthermore, an useful choice of the parameter ε is proposed. (3) It is actually a kind of ε -support vector regression (ε -SVR), the only difference here is that it takes the binary classification problem as a special kind of regression problem. (4) Therefore it can be implemented efficiently by the sequential minimization optimization (SMO) method for large scale problems. Experimental results on several benchmark datasets show the effectiveness of our method in sparseness, balance performance and classification accuracy, and therefore confirm the above conclusion further. © 2013 Elsevier Ltd. All rights reserved.

1. Introduction Support vector machines (SVMs), which were introduced by Vapnik and his co-workers in the early 1990s [1–3], are computationally powerful tools for supervised learning [4,5] and have already outperformed most other methods in a wide variety of applications [6–11]. Least squares support vector machines (LSSVMs) were also proposed [12,13] which only need to solve a linear system instead of a quadratic programming problem (QPP) in standard SVMs, and extensive empirical comparisons [14] show that LSSVMs obtain good performance on various classification and regression problems. LSSVMs have been studied extensively [15–18]. Unfortunately, there are two drawbacks in the plain LSSVMs. (1) Unlike the standard SVM employing a soft-margin loss function for classification and a ε -insensitive loss function for regression, LSSVMs lost the sparseness by using a quadratic loss function. (2) Another obvious limitation of LSSVMs is that although solving a linear system is in principle solvable [19], it is in practice intractable for a large dataset by the classical techniques since their computational complexity is usually of order O(l3 ) (l is the size of the training set), which severely limits the utility of LSSVMs in large scale applications. There are a lot of papers in the literature considering the above two issues so far. As for the fast algorithms for LSSVMs, Suykens et al. [20] presented an iterative algorithm based on the conjugate gradient algorithm, and Chu et al. [21] improved the conjugate gradient algorithm by solving one reduced linear system. Keerthi and Shevade [22] extends the well-known sequential minimization optimization (SMO) [23] algorithm of SVMs for the solution of LSSVMs. For the problems with very large numbers of data points but small numbers of features, Chua [24] proposed a method which involves working with (and storing) matrices that are at most of size l × n (l is the size of the training set, n is the number of features), and extend the possible range of application for LSSVMs. However, the resulting solutions of the above methods are not sparse yet.



Corresponding author. Tel.: +86 10 82680997. E-mail addresses: [email protected] (Y. Tian), [email protected] (X. Ju), [email protected] (Z. Qi), [email protected] (Y. Shi).

0898-1221/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.camwa.2013.06.028

2

Y. Tian et al. / Computers and Mathematics with Applications (

)



As for the sparse algorithms for LSSVMs, a range of methods are available and can be roughly concluded into two major classes: pruning and fixing size. In the first class, a simple approach to introduce the sparseness is based on the sorted support value spectrum (SVS) and prunes the network by gradually removing points from the training set [25,26]. A sophisticated mechanism was proposed by weighting the support values, the data point with the smallest error introduced after its omission is selected then. This pruning method is claimed to outperform the standard scheme [27]. Hoegaerts et al. [28] suggested an improved selection of the pruning point based on their derived criterion. Zeng and Chen [29] proposed a SMObased pruning method, the SMO method is introduced into the pruning process and instead of determining the pruning points by errors, the data points that will introduce minimum changes to a dual objective function are omitted. Li et al. [30] selected the reduced classification training set based on yi f (xi ) (f (x) is the decision function and y the label) instead of the support value. The second class mainly considers the fixed-size LSSVMs for fast finding the sparse approximate solution of LSSVMs, in which a reduced set of candidate support vectors is used in the primal space [13] or in kernel space [31–34]. However, there are still shortcomings in the existing sparse LSSVMs. For the first class, it imposes the sparseness by gradually omitting the least important data from the training set and re-estimating the LSSVMs, which is time consuming. For the second class, it is assumed that the weight vector w can be represented as a weighted sum of a limited number (far less than the size of the training set) of basis vectors, which is a rough approximation and not theoretically guaranteed. In this paper, we propose a novel LSSVM, termed ε -LSSVM for binary classification. ε -LSSVM introduces the ε -insensitive loss function instead of the quadratic loss function into LSSVM. (1) It has the sparseness which is controlled by the parameter ε . (2) By weighting different sparseness parameters for each class, the unbalanced problem can be solved successfully, furthermore, we also propose an useful choice of the parameter ε . (3) It is actually a kind of ε -support vector regression (ε -SVR) [3–5], the only difference here is that it takes the binary classification problem as a special kind of regression problem. (4) Certainly it can be implemented efficiently by SMO for large scale problems. The paper is organized as follows. Section 2 briefly dwells on the standard C -support vector machine for classification (C -SVC) and LSSVMs. Section 3 proposes our ε -LSSVM and a weighted ε -LSSVM is given in Section 4. Section 5 deals with experimental results. Section 6 contains concluding remarks. 2. Background In this section, we give a brief outline of C -SVC and LSSVMs. 2.1. C -SVC Consider the binary classification problem with the training set T = {(x1 , y1 ), . . . , (xl , yl )} ∈ (Rn × Y)l ,

(1)

where xi ∈ Rn , yi ∈ Y = {1, −1}, i = 1, . . . , l, standard C -SVC formulates the problem as a convex quadratic programming problem (QPP) 1

min

w,b,ξ

2

∥w∥2 + C

l 

ξi ,

i=1

s.t. yi ((w · xi ) + b) > 1 − ξi ,

ξi > 0,

i = 1, . . . , l,

(2)

i = 1, . . . , l,

where ξ = (ξ1 , . . . , ξl )⊤ , and C > 0 is a penalty parameter. For this primal problem, C -SVC solves its Lagrangian dual problem l l 1 

min α

s.t.

2 i=1 j=1

l 

αi αj yi yj K (xi , xj ) −

l 

αi ,

i=1

(3)

yi αi = 0,

i=1

0 6 αi 6 C ,

i = 1, . . . , l,

where K (x, x ) is the kernel function, which is also a convex QPP and then constructs the decision function. ′

2.2. LSSVM For the given training set (1), the primal problem of standard LSSVM to be solved is min

w,b,η

1 2

∥w∥2 +

l C 

ηi2 ,

2 i=1 s.t. yi ((w · xi ) + b) = 1 − ηi ,

(4) i = 1, . . . , l.

Y. Tian et al. / Computers and Mathematics with Applications (

)



3

Fig. 1. Geometric interpretation of LSSVM: positive points represented by ‘‘+’’s, negative points represented by ‘‘∗’’s, positive proximal line (w · x) + b = 1 (down left line), negative proximal line (w · x) + b = −1 (top right line), separating line (w · x) + b = 0 (middle line).

The geometric interpretation of the above problem with x ∈ R2 is shown in Fig. 1, where minimizing 12 ∥w∥2 realizes the maximum margin between the positive proximal straight line and negative proximal straight line

(w · x) + b = 1 and (w · x) + b = −1, (5) l 2 while minimizing i=1 ηi implies making the straight lines (5) be proximal to all positive inputs and negative inputs respectively. Its dual problem is also a convex QPP l l 1 

min α

s.t.

2 i =1 j =1

l 



αi αj yi yj K (xi , xj ) +

δij C

 −

l 

αi ,

i=1

(6)

αi yi = 0,

i=1

where K (x, x′ ) is the kernel function and

δij =



1, 0,

i = j; i ̸= j.

(7)

For the choice of the kernel function K (x, x′ ), one has several possibilities: K (x, x′ ) = (x · x′ ) (linear kernel); K (x, x′ ) = ((x · x′ ) + 1)d (polynomial kernel of degree d); K (x, x′ ) = exp(−∥x − x′ ∥2 /σ 2 ) (RBF kernel); K (x, x′ ) = tanh(κ(x · x′ ) + θ ) (Sigmoid kernel), etc. The solution of the above problem are given by the following set of linear equations



0 Y

−Y ⊤ Ω + C −1 I

  b

α

  =

0 , e

(8)

where Y = (y1 , . . . , yl )⊤ , Ω = (Ωij )l×l = (yi yj K (xi , xj ))l×l , I is the identity matrix and e = (1, . . . , 1)⊤ ∈ Rl , therefore the decision function is

 f (x) = sgn(g (x)) = sgn

l 

 αi yi K (xi , x) + b .

(9)

i=1

The support values αi are proportional to the errors at the data points since

αi = C ηi ,

i = 1, . . . , l.

(10)

Clearly, points located close to the two hyperplanes (w · x) + b = ±1 have the smallest support values, one could rather speak of the support value spectrum in the least squares case than the support vector in standard C -SVC. 3. ε-LSSVM As the points located close to the two hyperplanes (w · x) + b = ±1 have the smallest support values, they contribute less to the decision function (9). Following the idea of ε -insensitive loss function for the regression problem, the following

4

Y. Tian et al. / Computers and Mathematics with Applications (

)



Fig. 2. Geometric interpretation of ε -LSSVM: positive proximal line (w · x) + b = 1 (down left thick line), negative proximal line (w · x) + b = −1 (top right thick line), positive ε -bounded lines (w · x) + b = 1 ± ε (down left dotted lines), negative ε -bounded lines (w · x) + b = −1 ± ε (top right dotted lines), separating line (w · x) + b = 0 (middle line).

optimization problem is constructed

min

w,b,ξ (∗)

1 2

∥w∥2 +

l C 

2 i=1

(ξi2 + ξi∗2 ),

s.t. − 1 − ε − ξi∗ 6 (w · xi ) + b 6 −1 + ε + ξi , 1 − ε − ξi∗ 6 (w · xi ) + b 6 1 + ε + ξi ,

ξi , ξi > 0, ∗

for yi = −1,

(11)

for yi = 1,

i = 1, . . . , l,

where ε > 0 is a prior parameter. Now we discuss the primal problem (11) geometrically in R2 (see Fig. 2). On the one hand, we hope that the positive class locates as much as possible in the ε -band between the bounded hyperplanes (w · x) + b = 1 + ε and (w · x) + b = 1 − ε , the negative class is located as much as possible in the ε -band between the hyperplanes (w · x) + b = −1 + ε and (w · x) + b = −1 − ε , here the errors ηi + ηi∗ , i = 1, . . . , l are measured by the ε -insensitive loss function. On the other hand, we still hope to maximize the margin between the two proximal hyperplanes (w · x) + b = 1 and (w · x) + b = −1. Based on the above two considerations, problem (11) is established and the structural risk minimization principle is implemented naturally. For problem (11), the constraint ξi , ξi∗ > 0, i = 1, . . . , l, is redundant: a negative value of ξi or ξi∗ cannot appear in a solution (to the problem with this constraint removed) since the above feasible solution with ξi = 0 or ξi∗ = 0 gives a lower value for the objective function. Hence, problem (11) is obviously equivalent to the following problem

min

w,b,ξ ,ξ ∗

1 2

∥w∥2 +

l C 

2 i =1

(ξi2 + ξi∗2 ),

s.t. (w · xi ) + b − yi 6 ε + ξi , yi − (w · xi ) − b 6 ε + ξi∗ ,

(12)

i = 1, . . . , l, i = 1, . . . , l.

Interestingly but not surprisingly, we can see that problem (12) is in fact the ε -support vector regression machine with L2 -loss (L2 -SVR [35]) for the training set (1), here it takes yi as ±1 for positive and negative inputs separately. Now we map the training set T by a mapping Φ (x) to a Hilbert space H . In order to get the solution of problem (12) in H , we need to derive its dual problems. By introducing the Lagrangian L(w, b, ξ , ξ ∗ , α, α ∗ ) =

1 2

∥w∥2 +

+

l  i=1

l C 

l 

2 i=1

i =1

(ξi2 + ξi∗2 ) +

αi ((w · Φ (xi )) + b − yi − ε − ξi )

αi∗ (yi − (w · Φ (xi )) − b − ε − ξi∗ ),

(13)

Y. Tian et al. / Computers and Mathematics with Applications (

)



5

where α, α ∗ are the Lagrange multiplier vectors, the dual problem is obtained min α (∗)

l l 1 

2 i =1 j =1

(αi∗ − αi )(αj∗ − αj )K (xi , xj ) +

l 1 

2C i=1

(αi2 + αi∗2 ) + ε

l l   (αi∗ + αi ) − yi (αi∗ − αi ), i=1

i=1

(14)

l

s.t.

 (αi − αi∗ ) = 0, i=1

αi , αi∗ > 0,

i = 1, . . . , l,

where K (x, x ) = (Φ (x) · Φ (x′ )) is the kernel function. For this dual problem, we have the following conclusions. ′

Theorem 3.1. If α, ¯ α¯ ∗ is a solution of the problem (14), then α¯ i α¯ i∗ = 0 for i = 1, . . . , l. Proof. If α¯ i > 0, then from the KKT conditions

α¯ i ((w · xi ) + b − yi − ε − ξi ) = 0,

(15)

C ξi − α¯ i = 0,

(16)

we have

(w · xi ) + b − yi − ε = ξi > 0,

(17)

and based on the KKT condition

α¯ i∗ (yi − (w · xi ) − b − ε − ξi∗ ) = 0,

(18)

So α¯ i = 0. And vice versa, i.e., for α¯ i > 0, there is α¯ i = 0. ∗



Theorem 3.2. Problem (6) is equivalent to the problem (14) with ε = 0. Proof. Let yi βi = αi∗ − αi ,

i = 1, . . . , l,

(19)

since yi = 1 or −1, then

βi = yi (αi∗ − αi ),

i = 1, . . . , l.

(20)

Set ε = 0, problem (14) degenerates to the following problem min β

s.t.

l l 1 

2 i =1 j =1

l 

βi βj yi yj K˜ (xi , xj ) −

l 

βi ,

i =1

(21)

yi βi = 0,

i=1



which is the same as problem (6), where K˜ (xi , xj ) = K (xi , xj ) +

δij C



, i, j = 1, . . . , l.

Now, we are in a position to declare that LSSVM for binary classification can be implemented by L2 -SVR for the same training set with ε = 0. And if we want to endow LSSVM the valuable sparsity, we only need to apply standard L2 -SVR for the classification problem to get support vectors, then the ε -LSSVM is established. Algorithm 3.3 (ε -LSSVM). (1) (2) (3) (4)

Input the training set (1); Choose an appropriate kernel function K (x, x′ ), parameters C > 0 and ε > 0; Construct and solve the convex QPP (14), obtaining a solution α, ¯ α¯ ∗ ; ¯ If α¯ j > 0 is chosen, compute Compute b. b¯ = yj −

l  (α¯ i∗ − α¯ i )K (xi , xj ) + ε;

(22)

i =1

if α¯ k∗ > 0 is chosen, compute b¯ = yk −

l  i=1

(α¯ i∗ − α¯ i )K (xi , xk ) − ε.

(23)

6

Y. Tian et al. / Computers and Mathematics with Applications (

)



(5) Construct the decision function



 l  ∗ ¯ y = sgn(g (x)) = sgn (α¯ i − α¯ i )K (xi , x) + b .

(24)

i =1

Obviously, solving problem (14) can be efficiently implemented by LIBSVM [36] since it is actually a variation of ε -SVR. In fact, problem (14) can be concisely formulated as 1 ⊤ β Q β + p⊤ β, 2

min β

(25)

s.t. y⊤ β = 0,

β > 0, where Q ∈ R2l×2l , β, p, y ∈ R2l . [36] has proved that for such a problem, an SMO-type decomposition method [37] implemented in LIBSVM has the complexity as: (1) ♯Iterations ×O(l) if most columns of Q are cached throughout iterations; and (2) ♯Iterations ×O(nl) if columns of Q are not cached and each kernel evaluation costs O(n), while [36] also pointed out that there is no theoretical result yet on LIBSVM’s number of iterations. Empirically, it is known that the number of iterations may be higher than linear to the number of training data. 4. Weighted ε-LSSVM For the unbalanced classification problem, different with weighted C for each class (C -LSSVM) [26] 1

min

w,b,η

2

∥w∥2 +

C+  2

C− 

ηi2 +

2

yi =1

s.t. yi ((w · xi ) + b) = 1 − ηi ,

ηi2 ,

(26)

yi =−1

i = 1, . . . , l,

our ε -LSSVM applies a weighted sparse parameter ε for each class and the primal problem is constructed as 1

min

w,b,ξ (∗)

2

∥w∥2 +

l C 

(ξi2 + ξi∗2 ),

2 i=1 ∗ s.t. − 1 − ε− − ξi 6 (w · xi ) + b 6 −1 + ε− + ξi ,

1 − ε+ − ξi∗ 6 (w · xi ) + b 6 1 + ε+ + ξi ,

(27)

for yi = −1,

for yi = 1,

obviously its dual problem is min α (∗)

l l 1 



2 i=1 j=1

yi =−1

(αi∗ − αi )(αj∗ − αj )K˜ (xi , xj ) + ε−

(αi∗ + αi ) + ε+



(αi∗ + αi ) −

yi =1

l 

yi (αi∗ − αi ),

i=1

(28)

l  s.t. (αi − αi∗ ) = 0, i =1

αi , αi∗ > 0,

i = 1, . . . , l,



where K˜ (xi , xj ) = K (xi , xj ) +

δij C



, i, j = 1, . . . , l.

If the positive class is smaller than the negative class, smaller ε+ should be chosen than ε− , thus more negative points turn out to be non-support vector than that of positive points, and the problem is balanced. A recommended choice range of the ε− and ε+ is (0, 1), and the relation between ε− and ε+ satisfies

∼ l− (1 − ε− ), l+ (1 − ε+ ) =

(29)

where l+ and l− are the number of the positive points and negative points respectively. Eq. (29) means that the number of positive points outside the ε+ -band approximately equals the number of negative points outside the ε− -band, it also means that the number of positive SVs approximately equals the number of negative SVs. In order to illustrate the proposed weighted ε -LSSVM we generated a small unbalanced artificial two-dimensional twoclass dataset [38]. The dataset consist of 100 points, 15 of which are positive and 85 points are negative. When the problem is solved using plain LSSVM (4), the influence of the 85 negative points prevails over that of the much smaller set of positive data points. As a result, 5 out of 15 points in positive class are misclassified. The total training set correctness is 95%, with only 66.7% correctness for the smaller positive class and 100% correctness for the larger negative class. The resulting separating × C− we can see an improvement over the plain plane is shown in Fig. 3. When a weighted C -LSSVM is used where C+ = 85 15

Y. Tian et al. / Computers and Mathematics with Applications (

)



7

Fig. 3. An unbalanced dataset consisting of 100 points, 15 of which are positive represented by ‘‘+’’s, and 85 points of which are negative represented by ‘‘∗’’s. The separating plane (middle line) is obtained by using a plain LSSVM (4). The positive class is mostly ignored by the solution. The total training set correctness is 95% with 66.7% correctness for positive class and 100% correctness for negative class.

Fig. 4. Linear classifier improvement by weighted C -LSSVM is demonstrated on the same dataset of Fig. 3. The separating plane (middle line) is obtained by using a weighted C -LSSVM. Even though the positive class is correctly classified in its entirety, the overall performance is still rather unsatisfactory due to significant difference in the distribution of points in each of the classes. Total training set correctness is 89%.

LSSVM, in the sense that a separating plane is obtained that correctly classifies all the points in the positive class. However due to the significant difference in the cardinality of the two classes and the distribution of their points, a subset of 9 points in the negative class is now misclassified. The total training set correctness is 89%, with 100% correctness for positive class and 87.06% correctness for negative class. The resulting separating plane is shown in Fig. 4. If now weighted ε -LSSVM is used where ε+ = 0.1, ε− = 0.84 satisfies (29), we obtain a separating plane that misclassifies only one point. The total training set correctness is 98%. The resulting separating plane is shown in Fig. 5. 5. Experimental results In this section, some experiments are made to demonstrate the performance of our ε -LSSVM. All methods are implemented by using MATLAB 2010 on a PC with an Intel Core I5 processor with 2 GB RAM. C -SVC and ε -LSSVM are solved by the optimization toolbox QP in MATLAB. LSSVM is the special case of our ε -LSSVM when ε = 0. The ‘‘Accuracy’’ used to evaluate methods is defined as Accuracy = (TP + TN )/(TP + FP + TN + FN ), where TP , TN , FP, and FN are the number of true positive, true negative, false positive, and false negative, respectively. Classification accuracy of each method is measured by the standard tenfold cross-validation methodology. First, we apply ε -LSSVM to the iris dataset [39], which is an established dataset used for demonstrating the performance of classification algorithms. It contains three classes (Setosa, Versicolor, Virginica) and four attributes for an iris, and the goal is to classify the class of iris based on these four attributes. Here we restrict ourselves to the two classes (Versicolor, Virginica), and the two features that contain the most information about the class, namely the petal length and the petal width. The distribution of the data is illustrated in Fig. 6, where ‘‘+’’s and ‘‘∗’’s represent classes Versicolor and Virginica respectively. −∥x−x′ ∥2

) are used in which the parameter σ is fixed to be 1.0, and set C = 10, ε Linear and RBF kernel K (x, x′ ) = exp( σ varies in {0, 0.1, 0.2, 0.3, 0.4, 0.5}. Experiment results are shown in Figs. 6 and 7, where two proximal lines g (x) = −1 and g (x) = +1, four ε -bounded lines g (x) = −1 ± ε and g (x) = 1 ± ε , and separating line g (x) = 0 are depicted, and support

8

Y. Tian et al. / Computers and Mathematics with Applications (

)



Fig. 5. Very significant linear classifier improvement as a consequence of the ε -LSSVM is demonstrated on the same dataset of Figs. 3 and 4. The total training set correctness is now 98% compared to 95% for plain LSSVM and 89% for weighted C -LSSVM.

(a) ε = 0.

(d) ε = 0.3.

(b) ε = 0.1.

(e) ε = 0.4.

(c) ε = 0.2.

(f) ε = 0.5.

Fig. 6. Linear ε -LSSVM: positive proximal line g (x) = 1 (down left thick line), negative proximal line g (x) = −1 (top right thick line), positive ε -bounded lines g (x) = 1 ± ε (down left dotted lines), negative ε -bounded lines g (x) = −1 ± ε (top right dotted lines), separating line g (x) = 0 (middle line), support vectors (marked by ‘‘◦’’). With the increase of ε , the percentage of SVs decreases.

vectors are marked by ‘‘◦’’ for different ε . Fig. 8 records the varying percentage of support vectors. We can see that with the increasing ε , the number of support vectors decreases, therefore the sparseness increases for both linear and nonlinear cases. We also apply weighted ε -LSSVM to solve this classification problem, here half of the training points are randomly selected from the ‘‘∗’’ class (negative class). The sparse parameter ε− takes values in {0, 0.05, 0.1, 0.15, 0.2, 0.25} and ε+ is computed by (29). Experiment results are shown in Fig. 9 for linear kernel and Fig. 10 for RBF kernel separately, where the corresponding lines are depicted and support vectors are marked by ‘‘◦’’ for different (ε− , ε+ ). Fig. 11 records the varying percentage of positive support vectors, negative support vectors and total support vectors for linear and RBF cases

Y. Tian et al. / Computers and Mathematics with Applications (

)



(a) ε = 0.

(b) ε = 0.1.

(c) ε = 0.2.

(d) ε = 0.3.

(e) ε = 0.4.

(f) ε = 0.5.

9

Fig. 7. Kernel ε -LSSVM: positive proximal line g (x) = 1 (down left line), negative proximal line g (x) = −1 (top right line), positive ε -bounded lines g (x) = 1 ± ε (dotted lines around the positive proximal line), negative ε -bounded lines g (x) = −1 ± ε (dotted lines around the negative proximal line), separating line g (x) = 0 (thick line), support vectors (marked by ‘‘◦’’). With the increase of ε , the percentage of SVs decreases.

1 0.9 Percentage of SVs

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Fig. 8. Sparseness increases with the increasing ε . Linear case (top broken line), nonlinear case (down broken line).

separately; we can also see that with the increasing ε− and ε+ , the number of support vectors decreases, therefore the sparseness increases for both cases. Second, in order to compare our ε -LSSVM, weighted ε -LSSVM with LSSVM and C -SVC, we choose several datasets from the UCI machine learning repository [39]. In Table 1, the classification accuracy, and the percentage of support vectors are −∥x−x′ ∥2

listed. For all the methods, the RBF kernel K (x, x′ ) = exp( ) is used and the optimal parameters C and σ are obtained σ through searching in the range 2−8 to 28 , the optimal parameter ε in ε -LSSVM is obtained in the range [0.1, 1] with the step 0.1, by using a tuning set comprising of 30% of the dataset. Once the parameters are selected, the tuning set is returned to

10

Y. Tian et al. / Computers and Mathematics with Applications (

)



(a) ε− = 0.

(b) ε− = 0.05.

(c) ε− = 0.1.

(d) ε− = 0.15.

(e) ε− = 0.2.

(f) ε− = 0.25.

Fig. 9. Weighted ε -LSSVM with linear kernel for unbalanced dataset. (29) is used to compute ε+ for given ε− . Improved balanced results are obtained, and with the increase of ε− , ε+ , the percentage of SVs in each class decreases.

Table 1 Tenfold testing percentage accuracy of ε -LSSVM. Datasets

Hepatitis

(155 × 19) BUPA liver

(345 × 6)

Heart-Statlog

(270 × 14) Votes

(435 × 16) WPBC

(198 × 34) Sonar

(208 × 60) Ionosphere

(351 × 34) Australian

(690 × 14) Pima-Indian

(768 × 8) CMC

(1473 × 9)

LSSVM Accuracy % SVs %

C -SVC Accuracy % SVs %

ε -LSSVM Accuracy % SVs %

Weighted ε -LSSVM Accuracy % SVs %

81.63 ± 5.34

80.65 ± 5.32 35.48 ± 2.23

81.45 ± 3.17 33.49 ± 4.06

81.95 ± 2.66 31.63 ± 3.87

70.43 ± 4.27 79.13 ± 3.08

69.21 ± 4.73 76.49 ± 2.16

69.80 ± 3.59 75.04 ± 3.18

83.70 ± 6.18 43.33 ± 3.01

84.36 ± 3.77 41.15 ± 3.22

84.15 ± 3.41 39.33 ± 2.57

93.33 ± 3.85 40.46 ± 4.52

90.63 ± 2.76 38.31 ± 3.45

92.81 ± 3.14 37.19 ± 2.93

76.28 ± 4.68 51.55 ± 5.17

76.77 ± 2.94 48.36 ± 3.77

76.82 ± 3.53 50.64 ± 4.06

85.10 ± 5.04 41.83 ± 3.80

84.11 ± 3.81 42.04 ± 3.13

84.87 ± 3.75 41.17 ± 2.86

94.59 ± 5.53 25.07 ± 3.24

91.96 ± 3.72 22.54 ± 3.87

93.09 ± 2.47 23.17 ± 3.34

85.50 ± 4.17 41.01 ± 2.91

85.23 ± 4.27 39.28 ± 4.18

85.37 ± 3.44 38.71 ± 3.26

77.60 ± 3.76 53.26 ± 3.27

76.52 ± 4.33 50.12 ± 3.78

77.91 ± 4.27 50.55 ± 3.34

64.46 ± 3.26 69.67 ± 4.35

65.18 ± 2.69 67.31 ± 3.86

65.75 ± 3.18 66.18 ± 3.25

\ 67.84 ± 5.12

\ 83.29 ± 3.91

\ 90.72 ± 3.65

\ 74.59 ± 3.38

\ 83.11 ± 5.12

\ 91.02 ± 4.79

\ 85.09 ± 5.06

\ 76.08 ± 5.72

\ 64.12 ± 2.78

\

Y. Tian et al. / Computers and Mathematics with Applications (

)



(a) ε− = 0.

(b) ε− = 0.05.

(c) ε− = 0.1.

(d) ε− = 0.15.

(e) ε− = 0.2.

(f) ε− = 0.25.

11

Fig. 10. Weighted ε -LSSVM with RBF kernel for unbalanced dataset. (29) is used to compute ε+ for given ε− . Improved balanced results are obtained, and with the increase of ε− , ε+ , the percentage of SVs in each class decreases.

1 0.9

Percentage of SVs

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.05

0.1

0.15

0.2

0.25

Fig. 11. Sparseness of each class increases with the increasing ε− and ε+ . Linear case (broken lines with ‘‘’’), nonlinear case (broken lines with ‘‘’’).

learn the final classifier. For the unbalanced datasets, we apply weighted ε -LSSVM and set the smaller class to be negative, ε− is chosen in [0.1, 1] with the step 0.1, and Eq. (29) is used to approximately compute ε+ . In Table 1, the best accuracy and sparseness are shown by bold figures. It is easy to see that the accuracy and the sparseness of our ε -LSSVM and weighted ε -LSSVM are better than that of LSSVM on all datasets, since LSSVM is the special case of these two models. At the same time, the sparseness of ε -LSSVM and weighted ε -LSSVM are better than that of C -SVC on most datasets, while the accuracy is almost the same with that of it. Furthermore, weighted ε -LSSVM performs theoretically better than ε -LSSVM on most datasets, since ε -LSSVM is also the special case of weighted ε -LSSVM with ε− = ε+ . For example, for CMC, the accuracy of our ε -LSSVM and weighted ε -LSSVM is 65.18% and 65.75% respectively, while the accuracy of LSSVM and C -SVC is 64.12% and 64.46% respectively. The percentages of SVs of LSSVM are 100% obviously, while that of weighted ε -LSSVM are 66.18%, better than that of C -SVC 69.67%.

12

Y. Tian et al. / Computers and Mathematics with Applications (

)



6. Conclusion In this paper, we have proposed a novel LSSVM, termed ε -LSSVM for binary classification. By introducing the ε -insensitive loss function instead of the quadratic loss function into LSSVM, ε -LSSVM has several improved advantages compared with the plain LSSVM. (1) It has the sparseness which is controlled by the parameter ε . (2) By weighting different sparseness parameters ε for each class, the unbalanced problem can solved successfully. (3) It is actually a kind of ε -support vector regression (ε -SVR), the only difference here is that it takes the binary classification problem as a special kind of regression problem. (4) It can be implemented efficiently by SMO for large scale problems theoretically. Parameters ε control the sparseness and can be chosen flexibly, therefore improve the plain LSSVM in many ways. And an useful choice of ε− and ε+ for different classes was also given in the weighted ε -LSSVM algorithm. Computational comparisons between these two ε -LSSVMs and other methods including SVC and LSSVM have been made on several datasets, indicating the effectiveness of our method in sparseness, balance performance and classification accuracy. The extensions of ε -LSSVM to multi-class classification, robust classification, and multi-instance classification are also interesting and under our consideration. Acknowledgments This work has been partially supported by grants from the National Natural Science Foundation of China (No. 11271361, and No. 70921061), the CAS/SAFEA International Partnership Program for Creative Research Teams, Major International (Regional) Joint Research Project (No. 71110107026), and the President Fund of GUCAS. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32]

C. Cortes, V.N. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297. V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1996. V.N. Vapnik, Statistical Learning Theory, John Wiley and Sons, New York, 1998. N.Y. Deng, Y.J. Tian, Support Vector Machines: Theory, Algorithms and Extensions, Science Press, Beijing, 2009. N.Y. Deng, Y.J. Tian, Chunhua Zhang, Support Vector Machines: Optimization based Theory, Algorithms and Extensions, Chapman and Hall: CRC Press, 2012. M.M. Adankon, M. Cheriet, Model selection for the LS-SVM application to handwriting recognition, Pattern Recognition 42 (12) (2009) 3264–3270. M.B. Karsten, Kernel methods in bioinformatics, in: Handbook of Statistical Bioinformatics, Part 3, 2011, pp. 317–334. K.J. Kim, Financial time series forecasting using support vector machines, Neurocomputing 55 (1–2) (2003) 307–319. G. Schweikert, A. Zien, G. Zeller, J. Behr, C. Dieterich, CS. Ong, P. Philips, F. De Bona, L. Hartmann, A. Bohlen, et al., mGene: accurate SVM-based gene finding with an application to nematode genomes, Genome Research 19 (2009) 2133–2143. D. Anguita, A. Boni, Improved neural network for SVM learning neural networks, IEEE Transactions on Neural Networks 13 (5) (2002) 1243–1244. L.J. Cao, F.E.H. Tay, Support vector machine with adaptive parameters in financial time series forecasting, IEEE Transactions on Neural Networks 14 (6) (2003) 1506–1518. J.A.K. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Processing Letters 9 (3) (1999) 293–300. J.A.K. Suykens, V.G. Tony, D.B. Jos, D.M. Bart, V. Joos, Least Squares Support Vector Machines, World Scientific, 2002. T. Van Gestel, J. Suykens, B. Baesens, S. Viaene, J. Vanthienen, G. Dedene, B. De Moor, J. Vandewalle, Benchmarking least squares support vector machine classifiers, Machine Learning 54 (1) (2004) 5–32. M.M. Adankon, M. Cheriet, A. Biem, Semisupervised learning using Bayesian interpretation: application to LS-SVM, IEEE Transactions on Neural Networks 22 (4) (2011) 513–524. L.F. Bo, L.C. Jiao, L. Wang, Working set selection using functional gain for LS-SVM, IEEE Transactions on Neural Networks 18 (5) (2007) 1541–1544. K. Pelckmans, J.A.K. Suykens, B. De Moor, A convex approach to validation-based learning of the regularization constant, IEEE Transactions on Neural Networks 18 (3) (2007) 917–920. K. De Brabantera, J. De Brabantera, J.A.K. Suykensa, B. De Moora, Optimized fixed-size kernel methods for large scale data sets, Computational Statistics & Data Analysis 54 (6) (2010) 1484–1504. L.V. Ferreira, E. Kaszkurewicz, A. Bhaya, Solving systems of linear equations via gradient systems with discontinuous righthand sides: application to LS-SVM, IEEE Transactions on Neural Networks 16 (2) (2005) 501–505. J.A.K. Suykens, L. Lukas, P. Van Dooren, B. De Moor, J. Vandewalle, Least squares support vector machine classifiers: a large scale algorithm, in: Proc. the European Conference on Circuit Theory and Design (ECCTD-99) Stresa, Italy, Sep. 1999, pp. 839–842. W. Chu, C.J. Ong, S.S. Keerthy, An improved conjugate gradient method scheme to the solution of least squares SVM, IEEE Transactions on Neural Networks 16 (2) (2005) 498–501. S.S. Keerthi, S.K. Shevade, SMO algorithm for least-squares SVM formulations, Neural Computation 15 (2) (2003) 487–507. J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods Support Vector Learning, MIT Press, Cambridge, MA, 2000. K.S. Chua, Efficient computations for large least square support vector machine classifiers, Pattern Recognition Letters 24 (1–3) (2003) 75–80. J.A.K. Suykens, L. Lukas, J. Vandewalle, Sparse approximation using least squares support vector machines, in: Proc. 2000 IEEE International Symposium on ISCAS, Geneva, Switzerland, 2000, pp. 757–760. J.A.K. Suykens, J.D. Brabanter, L. Lukas, J. Vandewalle, Weighted least squares support vector machines: robustness and sparse approximation, Neurocomputing 48 (1–4) (2002) 85–105. B.J. de Kruif, T.J.A. de Vries, Pruning error minimization in least squares support vector machines, IEEE Transactions on Neural Networks 14 (3) (2004) 696–702. L. Hoegaerts, J.A.K. Suykens, J. Vandewalle, B. De Moor, A Comparison of Pruning Algorithms for Sparse Least Squares Support Vector Machines, in: Lecture Notes in Computer Science, vol. 3316, 2004, pp. 1247–1253. X.Y. Zeng, X.W. Chen, SMO-based pruning methods for sparse least squares support vector machines, IEEE Transactions on Neural Networks 16 (6) (2005) 1541–1546. Y.G. Li, C. Lin, W.D. Zhang, Improved sparse least-squares support vector machine classifiers, Neurocomputing 69 (13–15) (2006) 1655–1658. L. Hoegaerts, J. Suykens, J. Vandewalle, B. De Moor, Primal space sparse kernel partial least squares regression for large scale problems, in: IEEE Proc. Int. Joint Conf. Neural Networks, 2004, pp. 561–566. G.C. Cawley, N.L.C. Talbot, Improved sparse least-squares support vector machines, Neurocomputing 48 (1–4) (2002) 1025–1031.

Y. Tian et al. / Computers and Mathematics with Applications (

)



13

[33] G.C. Cawley, N.L.C. Talbot, Fast exact leave-one-out cross-validation of sparse least-squares support vector machines, Neural Networks 17 (10) (2004) 1467–1475. [34] L.C. Jiao, L.F. Bo, L. Wang, Fast sparse approximation for least squares support vector machine, IEEE Transactions on Neural Networks 18 (3) (2007) 685–697. [35] M.W. Chang, C.J. Lin, Leave-one-out bounds for support vector regression model selection, Neural Computation 17 (5) (2005) 1188–1222. [36] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (3) (2011) 27:1–27:27. [37] R.E. Fan, P.H. Chen, C.J. Lin, Working set selection using second order information for training SVM, Journal of Machine Learning Research 6 (2005) 1889–1918. URL http://www.csie.ntu.edu.tw/cjlin/papers/quadworkset.pdf. [38] G.M. Fung, O.L. Mangasarian, Multicategory proximal support vector machine classifiers, Machine Learning 59 (1–2) (2005) 77–97. [39] C.L. Blake, C.J. Merz, UCI repository for machine learning databases. Dept. Inf. Comput. Sci., Univ. California, Irvine [online]. Available: http://www.ics. uci.edu/mlearn/MLRepository.html.