Ray-guided global optimization method for training neural networks

Ray-guided global optimization method for training neural networks

Neurocomputing 30 (2000) 333}337 Letters Ray-guided global optimization method for training neural networks Ximin Zhang, Yan Qiu Chen* School of Ele...

230KB Sizes 1 Downloads 59 Views

Neurocomputing 30 (2000) 333}337

Letters

Ray-guided global optimization method for training neural networks Ximin Zhang, Yan Qiu Chen* School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, Singapore 639798, Singapore Received 1 March 1999; revised 18 May 1999; accepted 18 May 1999

Abstract

A novel method for globally searching to xnd the good minima is proposed in this paper. Starting from a local minimum, the weight space around it is scanned with the process being guided by terrain-independent emanating rays. During the search, starting points for further exploration are identi"ed and used to "nd corresponding local minima. Based on the correct classi"cation rate (CCR) on the validation data, the best minimum is found.  2000 Elsevier Science B.V. All rights reserved.

1. Introduction The backpropagation (BP) algorithm, developed by Werbos [6] and publicized by Rumelhart et al. [4], has attracted considerable research interest. The BP algorithm is a local-minimization method and has di$culties when the weight space is rugged (having many local minima) and when the surface is #at due to its gradient descent nature [3]. To overcome the de"ciencies resulting from local-search, global-minimization methods have been proposed [1,2,5,7]. However, an exhaustive global search is computationally infeasible and thus not suitable for practical applications, on the other hand, local search methods explore only a limited region in the weight space. In order to combine their advantages, we propose a novel Ray-guided exploration method (R-GEM) to expand the region of exploration and thus enable the BP algorithm to achieve better performance. The algorithm starts from a local minimum

* Corresponding author. E-mail address: [email protected] (Y.Q. Chen) 0925-2312/00/$ - see front matter  2000 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 5 - 2 3 1 2 ( 9 9 ) 0 0 1 5 8 - 7

334

X. Zhang, Y.Q. Chen / Neurocomputing 30 (2000) 333}337

and uses searching rays around the weight vector corresponding to the minimum to collect terrain information. During the search, promising starting points are identi"ed, from which BP is applied to "nd the corresponding minima. All the obtained minima are compared using a validation CCR-based criterion. The best minimum is then selected.

2. Ray-guided exploration method (R-GEM) The Ray-guided exploration method (R-GEM) combines global search to identify promising starting points and local search to "nd the minima corresponding to the starting points. The algorithm "rst "nds one minimum W(0), from which global search guided by emanating rays is launched. The emanating rays can be written as W K(t)"W K(0)#T K(t),

(1)

where W K"(wK, wK,2,wK), wK3R is the connection weight vector, m is the   L L search ray index, t is the ray parameter and TK(t) indicates the direction of the ray. The emanating ray plays an important role in uncovering new regions with potential local minima. Each ray is parameterized by the autonomous variable t. At t"0, the ray is at the starting minimum. As t increases, it moves away from the starting minimum. The directions of the rays are randomly chosen. The random emanating rays are governed by the following equation T K(t)"t * DK,

(2)

where DK"(d K, d K,2, d K), d K3R is a normalized vector randomly generated.   L L As rays advance with t, the starting points are obtained by utilizing the magnitude of error along the emanating rays. The identi"cation of the starting points is accomplished by searching for dips along the ray. Each time a starting point is identi"ed, BP is applied to "nd the corresponding local minimum. The procedure is repeated until some stopping conditions are satis"ed. A validation data set is used to evaluate the CCR corresponding to the new minimum. After searching along the speci"ed number of rays, the minimum and its corresponding weight vector with the best performance is found. The R-GEM algorithm is described by the following steps. Step 1: Apply BP to "nd a minimum W (0). Step 2: Randomly generate a searching ray with direction DK"(d K, d K,2, d K).   L Step 3: Search along the ray to "nd dips in the error level. Step 4: Apply BP to each of the dips found in Step 3. If the minimum o!ers better performance, then update using the corresponding vector. Step 5: Repeat Steps 2}4 for a speci"ed number of times. R-GEM tries to identify good starting points before applying local search. Hence it avoids searching unpromising regions from random starting points. Since each of the searching rays of R-GEM is terrain-independent, the rays are not interrelated to each other. The calculations of the searching rays are therefore independent of each other, and can be parallelized with little e!ort. R-GEM is thus inherently parallel.

X. Zhang, Y.Q. Chen / Neurocomputing 30 (2000) 333}337

335

For a global optimization problem over a feasible region S in RL with diameter ¸, it is reasonably assumed that the local minimum W are distributed randomly over G S with a uniform distribution. Each local minimum W has one attraction region A , G G which is de"ned as the set of points in S starting from which local search will converge to W . Then the probability of one ray hitting an attraction region is G mK(p!2) , (3) P(hit)" 2p where m is the number of the local minima and K is a constant. K is calculated by 1 K" ¸<(S)

 

2 ""W !W "" dw 2dw L ,  G G G

(4)

where <(S) is the volume of feasible region S. W "[w , w ,2, w L] is the initial     point and W "[w , w ,2, w L] is the local minimum. G G G G 3. Experimental results The proposed method is compared with simulated annealing, NOVEL, and multistart on the two-spiral problem and three benchmark problems * the sonar, the diabetes, and the glass problems (downloadable from ftp://ftp.ira.uka.de/pub/neuron). In applying R-GEM to the training of the network, we start from a point generated Table 1 Performance of R-GEM, multistart, NOVEL on two-spiral problem

R-GEM Multistart NOVEL

Training error

Training CCR (%)

Test error

Test CCR (%)

CPU time

0.006919 0.020944 0.017623

100 98.5 97.5

0.017999 0.043322 0.018481

95.5 91.0 94.5

2511 s 3912 s 2927 s

Fig. 1. Decision boundaries obtained by (a) R-GEM, (b) multistart and (c) NOVEL.

2 3

5 6

2 3

Diabetes

Glass

Sonar

125 187

86 102

24 35

100 100

89.7 92.5

84.4 86.2

92.3 92.3

66.0 67.9

80.2 79.2

Test

Training

Hidden units

Weights

R-GEM CCR (%)

Architecture

100 100

87.3 90.8

84.4 85.6

Training

92.3 92.3

64.9 66.5

80.2 78.8

Test

NOVEL CCR (%)

99.0 100

86.9 87.8

84.1 84.6

Training

90.4 91.3

64.2 66.0

79.6 78.6

Test

Multistart CCR (%)

99.0 100

87.8 86.9

84.1 84.4

Training

88.5 91.3

60.4 62.3

79.2 78.1

Test

SA CCR (%)

Table 2 Comparison of the best results obtained by R-GEM, multistart, NOVEL and simulated annealing (SA) for solving three benchmark problems

336 X. Zhang, Y.Q. Chen / Neurocomputing 30 (2000) 333}337

X. Zhang, Y.Q. Chen / Neurocomputing 30 (2000) 333}337

337

randomly in the weight space ranging between [!1,1] in each dimension. The range of t decides the range of global search. We "xed the search range of t to [!20, 20]. Eq. (1) is discretised for computer simulation: W K(t#*t)"W K(t)#*t * DK,

(5)

where *t is the step size. Too large a *t may cause loss of information. On the other hand, too small a *t requires excessive computation. It is found that *t3[0.005, 0.02] gives satisfactory results for the classi"cation problems in our experiments. To ensure a fair comparison, the same amount of CPU time (HP Visualize C180-MHZ) is allocated to each method and the same network structures are used. For the two-spiral problem, all the methods "nd some good minima within three hours of CPU time. The results as given in Table 1 show that R-GEM achieves signi"cantly higher CCR on test patterns than multistart. NOVEL's performance is comparable to R-GEM but we "nd its performance heavily dependent on the choice of the parameters (k ,k and time intervals). An unsuitable selection can lead to very E P bad results. The decision regions formed by the three methods are shown in Fig. 1 which also indicate that R-GEM gives the best results. For the three benchmark problems, the search time is set to 2 h. Table 2 shows the results for each method on the three problems. It is seen that R-GEM achieves the best results on both the training and the test patterns for all of the problems. NOVEL gives performance almost comparable to R-GEM. 4. Conclusions We have developed a global optimization method R-GEM for "nding good minima using emanating rays that traverse the error landscape. Because the selection of starting points are coming from the information-bearing error curve, R-GEM avoids many unnecessary computation in redetermining the already known regions. Compared to nonlinear optimization methods, the trace function in R-GEM is simple and R-GEM is ready to be implemented on a parallel computer. References [1] A. Corana, M. Marchesi, C. Martini, S. Ridella, Minimizing multimodal functions of continuous variables with the simulated annealing algorithm, ACM Trans. Math. Software 13 (3) (1987) 262}280. [2] R. Horst, H. Tuy, Global Optimization: Deterministic Approaches, Springer, Berlin, 1993. [3] R. Rojas, The backpropagation algorithm, Neural Networks: A Systematic Introduction, F. Murtagh, ed., Springer, Berlin, 1996, pp. 149}182. [4] D. Rumelhart, G. Hinton, R. Williams, Learning internal representations by error propagation, Parallel Distributed Processing, F. Murtagh, ed., MIT Press, Cambridge, MA, 1986, pp. 318}362. [5] E. Sturua, S. Zavriev, A trajectory algorithm based on the gradient method I. the search on the quasioptimal trajectories, J. Global Optim. 4 (1991) 375}388. [6] P. Werbos, Beyond Regression: New Tools for Prediction and Analysis in the Behavioural Sciences, Ph.D. Thesis, Harvard University, 1974. [7] S. Yi, W. Benjamin, Global optimization for neural network training, IEEE Comput. 29 (3) (1996) 45}54.