Calibrating artificial neural networks by global optimization

Calibrating artificial neural networks by global optimization

Expert Systems with Applications 39 (2012) 25–32 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal homepag...

389KB Sizes 1 Downloads 139 Views

Expert Systems with Applications 39 (2012) 25–32

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Calibrating artificial neural networks by global optimization János D. Pintér ⇑ Pintér Consulting Services, Inc., Halifax, NS, Canada School of Engineering, Özyeg˘in University, Istanbul, Turkey

a r t i c l e

i n f o

Keywords: Artificial neural networks Calibration of ANNs by global optimization ANN implementation in Mathematica Lipschitz Global Optimizer (LGO) solver suite MathOptimizer Professional (LGO linked to Mathematica) Numerical examples

a b s t r a c t Artificial neural networks (ANNs) are used extensively to model unknown or unspecified functional relationships between the input and output of a ‘‘black box’’ system. In order to apply the generic ANN concept to actual system model fitting problems, a key requirement is the training of the chosen (postulated) ANN structure. Such training serves to select the ANN parameters in order to minimize the discrepancy between modeled system output and the training set of observations. We consider the parameterization of ANNs as a potentially multi-modal optimization problem, and then introduce a corresponding global optimization (GO) framework. The practical viability of the GO based ANN training approach is illustrated by finding close numerical approximations of one-dimensional, yet visibly challenging functions. For this purpose, we have implemented a flexible ANN framework and an easily expandable set of test functions in the technical computing system Mathematica. The MathOptimizer Professional global–local optimization software has been used to solve the induced (multi-dimensional) ANN calibration problems. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction A neuron is a single nerve cell consisting of a nucleus (cell body, the cell’s life support center), incoming dendrites with endings called synapses that receive stimuli (electrical input signals) from other nerve cells, and an axon (nerve fiber) with terminals that deliver output signals to other neurons, muscles or glands. The electrical signal traveling down on the axon is called neural impulse: this signal is communicated to the synapses of some other neurons. This way, a single neuron acts like a signal processing unit, and the entire neural network has a complex, massive multi-processing architecture. The neural system is essential for all organisms, in order to adapt properly and quickly to their environment. An artificial neural network (ANN) is a computational model  implemented as a computer program  that emulates the essential features and operations of neural networks. ANNs are used extensively to model complex relationships between input and output data sequences, and to find hidden patterns in data sets when the analytical (explicit model function based) treatment of the problem is tedious or currently not possible. ANNs replace these unknown functional relationships by adaptively constructed approximating functions. Such approximators are typically designed on the basis of training examples of a given task, such as the recognition of hand-written letters or other structures and

⇑ Address: School of Engineering, Özyeg˘in University, Kusbakisi Caddesi, No. 2, 34662 Istanbul, Turkey. E-mail address: [email protected] URLs: http://www.ozyegin.edu.tr, http://www.pinterconsulting.com 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.06.050

shapes of interest. For in-depth expositions related to the ANN paradigm consult, e.g. Hertz, Krogh, and Palmer (1991), Cichocki and Unbehauen (1993), Smith (1993), Golden (1996), Ripley (1996), Mehrotra, Mohan, and Ranka (1997), Bar-Yam (1999), Steeb (2005) and Cruse (2006). The formal mathematical model of an ANN is based on the key features of a neuron as outlined above. We shall consider first a single artificial neuron (AN), and assume that a finite sequence of input signals s = (s1, s2, . . . , sT) is received by its synapses: the signals st t = 1, . . . , T can be real numbers or vectors. The AN’s information processing capability will be modeled by introducing an aggregator (transfer) function a that depends on a certain parameter vector x = (x1, x2, . . . , xn); subsequently, x will be chosen according to a selected optimality criterion. The optimization procedure is typically carried out in such a way that the sequence of model-based output signals m = (m1, m2, . . . , mT) generated by the function a = a(x, s) is as close as possible to the corresponding set (vector) of observations o = (o1, o2, . . . , oT), according to a given discrepancy measure. This description can be summarized symbolically as follows:

s ! AN ! m; s ¼ ðs1 ; . . . ; sT Þ;

mt ¼ aðx; st Þ;

o ¼ ðo1 ; . . . ; oT Þ;

t ¼ 1; . . . ; T:

ð1Þ m ¼ mðxÞ ¼ ðm1 ; m2 ; . . . ; mT Þ;

Objective: minimize d(m(x), o) by selecting the ANN parameterization x. A perceptron is a single AN modeled following the description given above. Flexible generalizations of a single perceptron based

26

J.D. Pintér / Expert Systems with Applications 39 (2012) 25–32

Fig. 1. Schematic representation of a multilayered ANN.

model are multiple perceptrons receiving the same input signals: such a structure is called an artificial neuron layer (ANL). To extend this structure further, multilayered networks can also be defined: in these several ANLs are connected sequentially. It is also possible to model several functional relationships simultaneously by the corresponding output sequences of a multi-response ANN. Fig. 1 illustrates a multilayered feed-forward ANN that includes an input layer with multiple inputs, an intermediate (hidden) signal processing layer, and an output layer that emits multiple responses. The hidden layer’s AN units receive the input information, followed by implementation specific (decomposition, extraction, and transformation) steps, to generate the output information. Hence, the ANN versions used in practice all share the features of adaptive, distributed, local, and parallel processing. Following (1), the optimal ANN parameterization problem can be stated as follows. Given the functional form of a, and the sequence s of input values, we want to find a parameterization x such that the difference d(m, o) = d(m(x), o) between the ANN output m = a(x, s) and the given training data set o is minimal. The discrepancy d(m, o) is typically calculated by introducing a suitable norm function: d(m, o) = ||m  o||. Once the parameter vector x becomes known  i.e., it has been numerically estimated as a result of using the training I/O sequences s and o  the trained ANN can be used to perform various tasks related to the assessment of further input data sequences such as {st} for t > T. Several important application examples of the outlined general paradigm are signal processing, pattern recognition, data classification, and time series forecasting. 2. Function approximation by neural networks The essential theoretical feature of ANNs is the ability to approximate continuous functions by learning from observed data, under very general analytical conditions. Specifically, the fundamental approximation theorem of Cybenko (1989) states that a feed-forward ANN with a single hidden layer that contains a finite number of neurons equipped with a suitable transfer function can serve as a universal approximator of any given continuous realvalued function f: Rk ? R defined on [0, 1]k. The approximation is uniform, in terms of the supremum norm of functions defined on [0, 1]k. We discuss this result below in some detail, and then introduce an optimization model corresponding to the symbolic summary description (1). For a somewhat more complete picture of the theoretical foundations of the ANN paradigm, let us remark that Funahashi (1989) and Hornik, Stinchcombe and White (1989) proved similar results to that of Cybenko. Specifically, Hornik et al. (1989) showed that feed-forward neural networks with a single hidden layer of suitable transfer function units can uniformly approximate any Borel measurable multivariate vector function f: Rk1 ? Rk2, to arbitrary prescribed accuracy on an arbitrary compact set C  Rk1. For more

detailed technical discussions of these and closely related approximation results, consult also the foundations laid out by Kolmogorov (1957): these however do not lead directly to the results cited here since the approximating formula itself depends on the function f considered. Further discussions of closely related function approximation results have been presented by Park and Sandberg (1991, 1993), Lang (2005) and Sutskever and Hinton (2008), with topical references. Let us consider now a single input signal st from Rk k P 1, and the corresponding scalar output of ot = f(st) for all t = 1, . . . , T: here f is an unknown continuous model function that we wish to approximate by a suitably defined ANN. In this general model setting the input signals can be delivered to a single synapse of an AN, or to several synapses of several ANs at the same time: the corresponding information will be aggregated by the ANN. Specifically, Cybenko (1989) established the existence of a parameterized aggregator function a(x, st), using a fixed type of transfer function r such that the approximation shown by (2) is uniformly valid:

X

f ðst Þ  aðx; st Þ :¼

aj r

j¼1;...;J

X

! yji sti þ hj :

ð2Þ

i¼1;...;I

To relate this general result directly to the underlying ANN model, in (2) j = 1, . . . , J denote the indices of the individual signal processing units ANj in the hidden layer. For each unit ANj, r is suitable function that will be described below, aj, yj, hj are model parameters to be determined depending on the (unknown) function instance f. In (2) r is a continuous real valued transfer function that, in full P generality, could also be a function of j. The argument i=1,. . .,I yji sti of function r is the scalar product of the synaptic weight vector yj of ANj and the signal st = (st1, . . . , stI) component arriving from synapse i of ANj, for i = 1, . . . , I. The parameter hj is called the threshold level (offset, or bias) of ANj. Finally, aj is the weight of ANj in the function approximation (2) for j = 1, . . . , J. Following Cybenko (1989) and Hornik et al. (1989), we assume that the transfer function r is monotonically non-decreasing, and that it satisfies the relations (3) shown below. (Transfer functions with the stated limit properties are called sigmoidal or squashing functions.)

lim rðzÞ ¼ 0 as z ! 1;

and

lim rðzÞ ¼ 1 as z ! 1:

ð3Þ

Let us note here that other types of transfer functions could be also used to generate valid function approximations. Suitable sets of radial basis functions (RBF) are another well-known example (Park & Sandberg, 1991, 1993; Röpke et al., 2005). RBFs have the general form /ðzÞ ¼ /ðjjz  z0 jjÞ: here |||| is a given norm function, z0 is the (scalar or vector) center of the monotonically non-increasing function / that satisfies the conditions /ð0Þ ¼ 1, and lim /ðzÞ ¼ 0 as ||z  z0||?1. We will continue the present discussion using a specific sigmoidal transfer function. A frequently used transfer function form is (4):

rðzÞ ¼ 1=ð1 þ ez Þ z 2 R:

ð4Þ

For proper theoretical and better numerical tractability, we will assume that finite box (variable range) constraints are defined with respect to the admissible values of the parameters aj, yji and hj. as expressed by (5):

ajmin 6 aj 6 ajmax for j ¼ 1; . . . ; J; yjimin 6 yji 6 yjimax hjmin 6 hj 6 hjmax

for i ¼ 1; . . . ; I

ð5Þ and j ¼ 1; . . . ; J;

for j ¼ 1; . . . ; J:

The collection of ANN model parameters will be denoted by x = ({aj}, {yji}, {hj}): here all the components are included for i=1, . . . , I and j = 1, . . . , J. The optimization problem induced by a

J.D. Pintér / Expert Systems with Applications 39 (2012) 25–32

given approximator function form is to minimize the error of the approximation (2), considering the training sequence (st, ot) t = 1, . . . , T, and using a selected norm function. In this study, we shall use the least squares error criterion to measure the quality of the approximation. In other words, the objective function (error function) e(x) defined by (6) will be minimized under the constraints (5):

eðxÞ ¼

X

ðf ðst Þ  aðx; st ÞÞ2 :

ð6Þ

t¼1;...;T

Let us note that other (more general linear or nonlinear) constraints regarding feasible parameter combinations could also be present, if dictated by the problem context. Such situations can be handled by adding the corresponding constraints to the box-constrained optimization model formulation (5) and (6). An important case in point is to attempt the exclusion of identical P solutions in which the component functions aj r( i=1,. . .,I yji sti + hj) of function a are, in principle, exchangeable. To break such symmetries, we can assume, e.g. that the components of {aj} are ordered so that we add the constraints

aj P ajþ1 for all j ¼ 1; . . . ; J  1:

ð7Þ

One could also add further lexicographic constraints if dictated by the problem at hand. Another optional – but not always applicable – constraint results by normalizing the weights {aj}, as shown below:

X

aj ¼ 1; aj P 0 for all j ¼ 1; . . . ; J:

ð8Þ

j¼1;...;J

27

implementation, one should restrict I and J to manageable values. These choices will evidently influence the accuracy of the approximation obtained. Summarizing the preceding discussion, first one has to postulate an ANN structure, and the family of parameterized transfer functions to use. Next, one needs to provide a set of training data. The training of the chosen ANN is then aimed at the selection of an optimally parameterized function from the selected family of function models, to minimize the chosen error function. The training of ANNs using optimization techniques is often referred to in the literature as back-propagation. Due to the possible multimodality of the error function (6), the standard tools of unconstrained local nonlinear optimization (direct search methods, gradient information based methods, Newton-type methods) often fail to locate the best parameter combination. The multimodality issue is well-known: consult, e.g. Bianchini and Gori (1996), Sexton, Alidaee, Dorsey, and Johnson (1998), Abraham (2004) and Hamm, Brorsen, and Hagan (2007). To overcome this difficulty, various global scope optimization strategies have been also used to train ANNs: these include experimental design, spatial sampling by low-discrepancy sequences, theoretically exact global optimization approaches, as well as popular heuristics such as evolutionary optimization, particle swarm optimization, simulated annealing, tabu search and tunneling functions. ANN model parameterization issues have been discussed and related optimization approaches proposed and tested, e.g. by Watkin (1993), Grippo (1994), Prechelt (1994), Bianchini and Gori (1996), Sexton et al. (1998), Jordanov and Brown (1999), Toh (1999), Vrahatis, Androulakis, Lambrinos, and Magoulas (2000), Ye and Lin (2003), Abraham (2004) and Hamm et al. (2007).

3. Postulating and calibrating ANN model instances 4. The continuous global optimization model Cybenko (1989) emphasizes that his result cited here only guarantees the theoretical possibility of a uniformly valid approximation of a continuous function f, as opposed to its exact representation. Furthermore, the general approximation result expressed by (2) does not specify the actual number of processing units required: hence, the key model parameter J needs to be set or estimated for each problem instance separately. Cybenko (1989) also conjectures that J could become ‘‘astronomical’’ in many function approximation problem instances, especially so when dealing with higher dimensional approximations. These points need to be kept in mind, to avoid making unjustifiable claims regarding the ‘‘universal power’’ of a priori defined (prefixed) ANN structures. Once we select – i.e., postulate or incrementally develop – the ANN structure to use in an actual application, the remaining essential task is to find its parameterization with respect to a given training data set. The objective function defined by (6) makes this a nonlinear model fitting problem that could have a multitude of local or global optima. The multi-modality aspect of nonlinear model calibration problems is discussed in detail by Pintér (1996, 2003); in the context of ANNs, consult, e.g. Auer, Herbster, and Warmuth (1996), Bianchini and Gori (1996) and Abraham (2004). Following the selection of the function form r, the number of real-valued parameters (i.e., the variables to optimize) is J(I + 2), each with given bound constraints. Furthermore, considering also both (7) and (8), the number of optionally added linear constraints is J. Since in the general ANN modeling framework the values of I and J cannot be postulated ‘‘exactly’’, in many practical applications one can try a number of computationally affordable combinations. In general, ANNs with more components can serve as better approximators, within reason and in line with the size T of the training data set that should minimally satisfy the condition T P J(I + 2). At the same time, in order to assure the success of a suitable nonlinear optimization procedure and its software

Next we will discuss a global optimization based approach that will be subsequently applied to handle illustrative ANN calibration problems. In accordance with the notations introduced earlier, we use the following symbols and assumptions: x e Rn e: Rn ? R D  Rn

n-dimensional real-valued vector of decision variables; continuous (scalar-valued) objective function; non-empty set of feasible solutions, a proper subset of Rn

The feasible set D is assumed to be closed and bounded: it is defined by l e Rn, u e Rn g: Rn ? Rm

component-wise finite lower and upper bounds on x; and an (optional) m-vector of continuous constraint functions

Applying these notations, we shall consider the global optimization model

Minimize eðxÞ   subject to x e D :¼ x : l 6 x 6 u g j ðxÞ 6 0 j ¼ 1; . . . ; m :

ð9Þ

The concise model formulation (9) covers numerous special cases under additional structural specifications. The model (9) also subsumes the ANN model calibration model statement summarized by formulas (4)–(6), with the optionally added constraints (7) and/or (8). The analytical conditions preceding the GO model statement guarantee that the set of global solutions to (9) is non-empty, and the postulated model structure supports the application of

28

J.D. Pintér / Expert Systems with Applications 39 (2012) 25–32

properly defined globally convergent deterministic and stochastic optimization algorithms. For related technical details that are outside of the scope of the present discussion, consult Pintér (1996). For further comprehensive expositions regarding global optimization models, algorithms and applications, consult also Horst and Pardalos (1995) and Pardalos and Romeijn (2002). In spite of the stated theoretical guarantees, the numerical solution of GO model instances can be a tough challenge, and even low-dimensional models can be hard to solve. These remarks apply also to the calibration of ANN instances as it will be illustrated later on. 5. The LGO software package for global and local nonlinear optimization The proper calibration of ANNs typically requires GO methodology. In addition, the ANN modeling framework has an inherent ‘‘black box’’ feature. Namely, the error function e(x) is evaluated by a numerical procedure that  in a model development and testing exercise, as well as in many applications  may be subject to changes and modifications. Similar situations often arise in the realm of nonlinear optimization: their handling requires easy-touse and flexible solver technology, without the need for model information that could be difficult to obtain. The Lipschitz Global Optimizer (LGO) solver suite has been designed to support the solution of ‘‘black box’’ problems, in addition to analytically defined models. Developed since the late 1980s, LGO is currently available for a range of compiler platforms (C, C++, C#, FORTRAN 77/90/95), with seamless links to several optimization modeling languages (AIMMS, AMPL, GAMS, MPL), MS Excel spreadsheets, and to the leading high-level technical computing systems Maple, Mathematica, and MATLAB. For algorithmic and implementation details not discussed here, we refer to Pintér (1996, 1997, 2002, 2005, 2007, 2009, 2010a, 2010b, 2010c), Pintér and Kampas (2005) and Pintér, Linder, and Chin (2006). The overall design of LGO is based on the flexible combination of nonlinear optimization algorithms, with corresponding theoretical global and local convergence properties. Hence, LGO – as a standalone solver suite – can be used for both global and local optimization; the local search mode can be also used without a preceding global search phase. In numerical practice – that is, in hardware resource-limited and time-limited runs – each of LGO’s global search options generates a global solution estimate(s) that is (are) refined by the seamlessly following local search mode(s). This way, the expected result of using LGO is a global and/or local search based high-quality feasible solution that meets at least the local optimality criteria. (To guarantee theoretical local optimality, standard local smoothness conditions need to apply.) At the same time, one should keep in mind that no global  or, in fact, any other  optimization software will ‘‘always’’ work satisfactorily, with default settings and under resource limitations related to model size, time, model function evaluation, or to other preset usage limits. Extensive numerical tests and a growing range of practical applications demonstrate that LGO and its platform-specific implementations can find high-quality numerical solutions to complicated and sizeable GO problems. Optimization models with embedded computational procedures and modules are discussed, e.g. by Pintér (1996, 2009), Pintér et al. (2006) and Kampas and Pintér (2006, 2010). Examples of real-world optimization challenges handled by various LGO implementations are cited here from environmental systems modeling and management (Pintér, 1996), laser design (Isenor, Pintér, & Cada, 2003), intensity modulated cancer therapy planning (Tervo, Kolmonen, Lyyra-Laitinen, Pintér, & Lahtinen, 2003), the operation of integrated oil and gas production systems (Mason et al., 2007), vehicle component design (Goossens, McPhee, Pintér, & Schmitke, 2007), various packing

problems with industrial relevance (Castillo, Kampas, & Pintér, 2008), fuel processing technology design (Pantoleontos et al., 2009), and financial decision making (Çag˘layan & Pintér, submitted for publication). 6. Numerical examples and discussion 6.1. Computational environment The testing and evaluation of optimization software implementations is a significant (although sometimes under-appreciated) research area. One needs to follow or to establish objective evaluation criteria, and to present fully reproducible results. The software tests themselves are conducted by solving a chosen set of test problems, and by applying criteria such as the reliability and efficiency of the solver(s) used, in terms of the quality of solutions obtained and the computational effort required. For examples of benchmarking studies in the ANN context, we refer to Prechelt (1994) and Hamm et al. (2007). In the GO context, consult, e.g. Ali, Khompatraporn, and Zabinsky (2005), Khompatraporn, Pintér, and Zabinsky (2005) and Pintér (2002, 2007), with further topical references. Here we will illustrate the performance of an LGO implementation in the context of ANN calibration, by solving several function approximation problems. Although the test functions presented are merely one-dimensional, their high-quality approximation requires the solution of induced GO problems with several tens of variables. The key features of the computational environment used and the numerical tests are summarized below: the detailed numerical results are available from the author upon request.  Hardware: PC, 32-bit, Intel™Core™2 Duo CPU T5450 @ 1.66 GHz, 2.00 GB RAM. Let us remark that this is an average laptop PC with modest capabilities, as of 2011.  Operating system: Windows Vista.  Model development and test environment: Mathematica (Wolfram Research, 2011).  Global optimization software: MathOptimizer Professional (MOP). MathOptimizer Professional is the product name (trademark) of the LGO solver implementation that is seamlessly linked to Mathematica. For details regarding the MOP software, consult, e.g. Pintér and Kampas (2005) and Kampas and Pintér (2006).  ANN computational model used: 1-J-1 type feed-forward network instances that have a single input node, a single output node, and J processing units. Recall that this ANN structure directly follows the function approximation formula expressed by (2). The ANN modeling framework, its instances and the test examples discussed here have been implemented (by the author) in native Mathematica code.  ANN initialization by assigned starting parameter values is not necessary, since MOP uses a global scope optimization approach by default. However, all LGO solver implementations also support the optional use of initial values in their constrained local optimization mode, without applying first a global search mode.  A single optimization run is executed in all cases: repetitions are not needed, since MathOptimizer Professional produces identical runs under the same initializations and solver options. Repeated runs with possibly changing numerical results can be easily generated by changing several options of MOP. These options make possible, e.g. the selection of the global search mode applied, the use of different starting points in the local search mode, the use of different random seeds, and the allocation of computational resources and runtime.  In all test problems the optimized function approximations are based on a given set of training points, in line with the modeling framework discussed in Section 2.

29

J.D. Pintér / Expert Systems with Applications 39 (2012) 25–32

 Simulated noise structure: none. In other words, we assume that all training data are exact. This assumption could be easily relaxed, e.g. by introducing a probabilistic noise structure as it has been done in Pintér (2003).  Key performance measure: standard error (the root of the mean square error, RMSE) calculated for the best parameterization found. For completeness, recalling (6), RMSE is defined by the P expression (e(x)/T)1/2 = ( t=1,. . .,T (f(st)  a(x, st))2/T)1/2.  Additional Mathematica features used: documentation and visualization capabilities related to the ANN models and tests discussed here, all included in a single (Mathematica notebook) document.  Additional MOP features: the automatic generation of optimization test model code in C or FORTRAN (as per user preferences), and the automatic generation of optimization input and result text files. Let us point out that these features can be put to good use (also) in developing and testing compiler code based ANN implementations. In the numerical examples presented below, our objective is to ‘‘learn’’ the graph of certain one-dimensional functions defined on the interval [0, 1], based on a preset number of uniformly spaced training points. As earlier, we apply the notation T for the number of training data, and J for the number of processing units in the hidden layer. 6.2. Example 1 The first (and easiest) test problem presented here has been studied also by Toh (1999), as well as by Ye and Lin (2003). The function to approximate is defined as

f ðxÞ ¼ sinð2pxÞcosð4pxÞ;

0 6 x 6 1:

ð10Þ

We set the number of training data as T = 100 and postulate the number of processing units as J = 7. Recalling the discussion of Section 2 (cf. especially formula (2)), this choice of J induces a non-trivial box-constrained GO model in 21 variables. Applying first MOP’s local search mode (LS) only, we obtained an overall function approximation characterized by RMSE 0.0026853412. With default MOP settings, the number of error function evaluations is 60,336; the corresponding runtime is 4.40 s. (The program execution speed could somewhat vary, of course, depending on the actual tasks handled by the computer used.) For comparison, next we also solved the same problem using LGO’s most extensive global search mode (stochastic multi-start, MS) followed by corresponding local searches from the best points found. This operational mode is referred to briefly as MSLS, and it has been used in all subsequent numerical examples. In the MSLS operational mode, we set (postulate) the variable ranges [lb, ub] = [20, 20] for all ANN model parameters, and we set the limit for the global search steps (the number of error function evaluations) as gss = 106. Based on our numerical experience related to GO, this does not seem to be an ‘‘exorbitant’’ setting, since we have

to optimize 21 parameters globally. (For a simple comparison, if we aimed at just 10% precision to find the best variable setting in 21 dimensions by applying a uniform grid based search, then this naïve approach would require 1021 function evaluations.) Applying the MSLS operational mode, we received RMSE 0.0001853324. This means a significant 93% error reduction compared to the already rather small average error found using only the LS operational mode. The actual number of function evaluations is 1,083,311; the corresponding runtime is 77.47 s. To illustrate and summarize our result visually, function (10) is displayed below: see the left hand side (lhs) of Fig. 2. This is directly followed by the list plot produced by the output of the optimized ANN at the equidistant training sample arguments: see the right hand side (rhs) of Fig. 2. Even without citing the actual parameter values here, it is worth noting that the optimized ANN parameterizations found by the local (LS) and global (MSLS) search modes are markedly different. This finding, together with the results of our further tests, indicates the potentially massive multi-modality of the objective function in similar – or far more complicated – approximation problems. Let us also remark that several components of the ANN parameter vectors found lie on the boundary of the preset interval range [20, 20]. This finding indicates a possible need for extending the search region: we did not do this in our limited numerical study, since the approximation results are very satisfactory even in this case. Obviously, one could experiment further, e.g. by changing the value of J (the number of processing units), as well as by changing the parameter bounds if these measures are expected to lead to markedly improved results. Let us also note that in this particular test example, we could have perhaps reduced the global search effort, or just use the local search based estimate which already seems to be ‘‘good enough’’. Such experimental ‘‘streamlining’’ of the optimization procedure may not always work, however, and we are convinced that proper global scope search is typically needed in calibrating ANNs. Hence, we will solve all further test problems using the MSLS operational mode. The numerical examples presented below are chosen to be increasingly more difficult, although all are ‘‘merely’’ approximation problems related to one-dimensional functions. (Recall at the same time that the corresponding GO problems have tens of variables.) Obviously, one could ‘‘fabricate’’ functions, especially so in higher dimensions, that could become very difficult to approximate solely by applying ANN technology. This comment is in line with the observation of Cybenko cited earlier, regarding the possible need for ‘‘astronomical’’ choices of J. This caveat should be kept in mind when applying ANNs to solve complicated real-world applications. 6.3. Example 2 In this example, our objective is to approximate the function shown below:

f ðxÞ ¼ sinð2pxÞsinð3pxÞsinð5pxÞ;

1.0

1.0

0.5

0.5

0.2

0.4

0.6

0.8

20

1.0

0.5

0.5

1.0

1.0

40

60

Fig. 2. Function (10) and the list plot produced by the optimized ANN.

80

0 6 x 6 1:

100

ð11Þ

30

J.D. Pintér / Expert Systems with Applications 39 (2012) 25–32 0.6

0.6

0.4

0.4

0.2

0.2

0.2

0.4

0.6

0.8

20

1.0

0.2

0.2

0.4

0.4

0.6

0.6

40

60

80

100

Fig. 3. Function (11) and the list plot produced by the optimized ANN.

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.2

0.4

0.6

0.8

20

1.0

0.1

0.1

0.2

0.2

40

60

80

100

Fig. 4. Function (12) and the list plot produced by a sub-optimally trained ANN.

duced by the (sub-optimally) trained ANN. Comparing the lhs and rhs of Fig. 4, one can see the apparent discrepancy between the function and its approximation: although the ANN output reflects well some key structural features, it definitely misses some of the ‘‘finer details’’. This finding shows that the postulated ANN structure may not be adequate, and/or the training set may not be of sufficient size, and/or that the allocated global scope search effort may not be enough. In several subsequent numerical experiments, we attempted to increase the precision of approximating function (12), by increasing one or several of the following factors: the number of training data T, the number of processing units J, the global search effort gss, and the solver runtime allocated to MOP. Obviously, all the above measures could lead to improved results. Without going into unnecessary details, Fig. 5 illustrates that, e.g. the combination T = 200, J = 20, gss = 3,000,000, and the LGO runtime limit set to 30 min leads to a fairly accurate approximation of the entire function: the corresponding RMSE indicator value is 0.0016966244. The total number of function evaluations in this case is 3,527,313; the corresponding runtime is 1360.56 s. Again, increasing the sampling effort (T), and/or the ANN configuration size (J), and/or the search effort (gss) even better approximations could be found, if increased accuracy is required. Our illustrative results demonstrate that GO based ANN calibration can be used to handle non-trivial function approximation problems, with a reasonable computational effort using today’s average personal computer. Our approach (as described) is also directly applicable to solve at least certain types of multi-input and multi-response function approximation and related problems.

To handle this problem, we use again T = 100, but now assume (postulate) that J = 10. Notice that the number of processing units has been increased heuristically, reflecting our belief that function (11) could be more difficult to reproduce than (10). This setting leads to a 30-variable GO model: however, we keep the allocated global search effort gss = 106 as in Example 1. We conjecture again that the induced model calibration problem is highly multi-extremal. The global search based numerical solution yields RMSE 0.0002453290; the corresponding runtime is 99.94 s spent on 1,013,948 function evaluations. For a visual comparison, the function (11) is displayed below, directly followed by the list plot produced by the output of the trained ANN: see the lhs and rhs of Fig. 3, respectively. 6.4. Example 3 In the last illustrative example presented here, our objective is to approximate the function shown below by (12): here log denotes the natural logarithm function.

f ðxÞ ¼ sinð5pxÞsinð7pxÞsinð11pxÞlogð1 þ xÞ;

0 6 x 6 1:

ð12Þ

As above, we set T = 100, J = 10, and use the MSLS search mode with 106 global search steps; furthermore, now we extend the search region to [50, 50] for each ANN model parameter. The resulting global search based numerical solution has an RMSE 0.1811353757 which is noticeably not as good as the previously found RMSE values. The corresponding runtime is 111.61 s, spent on 1,130,366 function evaluations. For comparison, the function (12) is displayed by Fig. 4, directly followed by the list plot pro0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.2

0.4

0.6

0.8

50

1.0

0.1

0.1

0.2

0.2

100

150

Fig. 5. Function (12) and the list plot produced by the globally optimized ANN.

200

J.D. Pintér / Expert Systems with Applications 39 (2012) 25–32

Specifically, if the input data sequences can be processed by independently operated intermediate ANN layers for each function output, then the approach introduced can be applied (repetitively or in parallel) to such multi-response problems. However, if the input sequences need to be aggregated, then in general this will lead to increased GO model dimensionality. This remark hints at the potential difficulty of calibrating general multiple-input multipleoutput ANNs. 7. Conclusions Artificial neural networks are frequently used to model unknown functional relationships between the input and output of some system perceived as a ‘‘black box’’. The general function approximator capability of ANNs is supported by strong theoretical results. At the same time, the ANN framework inherently has a meta-heuristic flavor, since it needs to be instantiated and calibrated for each new model type to solve. Although the selection of a suitable ANN structure can benefit from domain-specific expertise, its calibration often becomes difficult as the complexity of the tasks to handle increases. Until recently, more often than not local – as opposed to global scope – search methods have been applied to parameterize ANNs. These facts call for flexible and transparent ANN implementations, and for improved ANN calibration methodology. To address some of these issues, a general modeling and global optimization framework is presented to solve ANN calibration problems. The practical viability of the proposed ANN calibration approach is illustrated by finding high-quality numerical approximations of visibly challenging one-dimensional functions. For these experiments, we have implemented a flexible ANN framework and an easily expandable set of test functions in the technical computing system Mathematica. The MathOptimizer Professional global-local optimization software has been used to solve the induced multi-dimensional ANN calibration problems. After defining the (in principle, arbitrary continuous) function to approximate, one needs to change only a few input data to define a new sampling plan, to define the actual ANN instance from the postulated ANN family, and to activate the optimization engine. The corresponding results can be then directly observed and visualized, for optional further analysis and comparison. Due to its universal function approximator capability, the ANN computational structure flexibly supports the modeling of many important research problems and real world applications. From the broad range of ANN applications several areas are closely related to our discussion: examples are signal and image processing (Masters, 1994; Watkin, 1993), data classification and pattern recognition (Karthikeyan, Gopal, & Venkatesh, 2008; Ripley, 1996), function approximation (Toh, 1999; Ye & Lin, 2003), time series analysis and forecasting (Franses & van Dijk, 2000; Kajitani, Hipel, & McLeod, 2005), experimental design in engineering systems (Röpke et al., 2005), and the prediction of trading signals of stock market indices (Tilakaratne, Mammadov, & Morris, 2008). Acknowledgements The author wishes to acknowledge the valuable contributions of Dr. Frank Kampas to the development of MathOptimizer Professional. Thanks are also due to Dr. Maria Battarra and to the editors and reviewers for all constructive comments and suggestions regarding the manuscript. This research has been partially supported by Özyeg˘in University (Istanbul, Turkey), and by the Simulation and Optimization } r, Hungary; TÁMOP Project of Széchenyi István University (Gyo 4.2.2-08/1-2008-0021).

31

References Abraham, A. (2004). Meta learning evolutionary artificial neural networks. Neurocomputing, 56, 1–38. Ali, M. M., Khompatraporn, C., & Zabinsky, Z. B. (2005). A numerical evaluation of several stochastic algorithms on selected continuous global optimization test problems. Journal of Global Optimization, 31, 635–672. Auer, P., Herbster, M., & Warmuth, M. (1996). Exponentially many local minima for single neurons. In D. Touretzky et al. (Eds.). Advances in neural information processing systems (Vol. 8, pp. 316–322). Cambridge, MA: MIT Press. Bar-Yam, Y. (1999). Dynamics of complex systems. Boulder, CO: Westview Press. Bianchini, M., & Gori, M. (1996). Optimal learning in artificial neural networks: A review of theoretical results. Neurocomputing, 13, 313–346. Çaglayan, M. O., & Pintér, J. D. (2010). Development and calibration of a currency trading strategy by global optimization. Research Report, Özyegin University, Istanbul. Submitted for publication. Castillo, I., Kampas, F. J., & Pintér, J. D. (2008). Solving circle packing problems by global optimization: Numerical results and industrial applications. European Journal of Operational Research, 191, 786–802. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing. Chichester, UK.: John Wiley & Sons. Cruse, H. (2006). Neural networks as cybernetic systems (2nd revised ed.). Bielefeld, Germany: Brains, Minds and Media. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2, 303–314. Franses, P. H., & van Dijk, D. (2000). Nonlinear time series models in empirical finance. Cambridge, UK: Cambridge University Press. Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183–192. Golden, R. M. (1996). Mathematical methods for neural network analysis and design. Cambridge, MA: MIT Press. Goossens, P., McPhee, J., Pintér, J. D., & Schmitke, C. (2007). Driving innovation: How mathematical modeling and optimization increase efficiency and productivity in vehicle design. Maplesoft, Waterloo, ON: Technical Memorandum. Grippo, L. (1994). A class of unconstrained minimization methods for neural network training. Optimization Methods and Software, 4(2), 135–150. Hamm, L., Brorsen, B. W., & Hagan, M. T. (2007). Comparison of stochastic global optimization methods to estimate neural network weights. Neural Processing Letters, 26, 145–158. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Horst, R., & Pardalos, P. M. (Eds.). (1995). Handbook of global optimization (Vol. 1). Dordrecht: Kluwer Academic Publishers. Isenor, G., Pintér, J. D., & Cada, M. (2003). A global optimization approach to laser design. Optimization and Engineering, 4, 177–196. Jordanov, I., & Brown, R. (1999). Neural network learning using low-discrepancy sequence. In N. Foo (Ed.). AI’99, Lecture notes in artificial intelligence (Vol. 1747, pp. 255–267). Berlin, Heidelberg: Springer-Verlag. Kajitani, Y., Hipel, K. W., & McLeod, A. I. (2005). Forecasting nonlinear time series with feed-forward neural networks: a case study of Canadian lynx data. Journal of Forecasting, 24, 105–117. Kampas, F. J., & Pintér, J. D. (2010). Advanced nonlinear optimization in Mathematica. Research Report, Özyegin University, Istanbul. Submitted for publication. Kampas, F. J., & Pintér, J. D. (2006). Configuration analysis and design by using optimization tools in mathematica. The Mathematica Journal, 10(1), 128–154. Karthikeyan, B., Gopal, S., & Venkatesh, S. (2008). Partial discharge pattern classification using composite versions of probabilistic neural network inference engine. Expert Systems with Applications, 34, 1938–1947. Khompatraporn, Ch., Pintér, J. D., & Zabinsky, Z. B. (2005). Comparative assessment of algorithms and software for global optimization. Journal of Global Optimization, 31, 613–633. Kolmogorov, A. N. (1957). On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk SSR, 114, 953–956. Lang, B. (2005). Monotonic multi-layer perceptron networks as universal approximators. In W. Duch et al. (Eds.). Artificial neural networks: Formal models and their applications – ICANN 2005, Lecture notes in computer science (Vol. 3697, pp. 31–37). Berlin, Heidelberg: Springer-Verlag. Mason, T. L., Emelle, C., van Berkel, J., Bagirov, A. M., Kampas, F., & Pintér, J. D. (2007). Integrated production system optimization using the Lipschitz global optimizer and the discrete gradient method. Journal of Industrial and Management Optimization, 3(2), 257–277. Masters, T. (1994). Signal and image processing with neural networks. New York: John Wiley & Sons. Mehrotra, K., Mohan, C. K., & Ranka, S. (1997). Elements of artificial neural nets. Cambridge, MA: MIT Press. Pardalos, P. M., & Romeijn, H. E. (Eds.). (2002). Handbook of global optimization (Vol. 2). Dordrecht: Kluwer Academic Publishers. Pantoleontos, G., Basinas, P., Skodras, G., Grammelis, P., Pintér, J. D., Topis, S., & Sakellaropoulos, G. P. (2009). A global optimization study on the devolatilisation kinetics of coal, biomass and waste fuels. Fuel Processing Technology, 90, 762–769. Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basisfunction networks. Neural Computation, 3(2), 246–257.

32

J.D. Pintér / Expert Systems with Applications 39 (2012) 25–32

Park, J., & Sandberg, I. W. (1993). Approximation and radial-basis-function networks. Neural Computation, 5(2), 305–316. Pintér, J. D. (1996). Global optimization in action. Dordrecht: Kluwer Academic Publishers. Pintér, J. D. (1997). LGO  A program system for continuous and Lipschitz optimization. In I. M. Bomze, T. Csendes, R. Horst, & P. M. Pardalos (Eds.), Developments in global optimization (pp. 183–197). Dordrecht: Kluwer Academic Publishers. Pintér, J. D. (2002). Global optimization: Software, test problems, and applications. In P. M. Pardalos & H. E. Romeijn (Eds.). Handbook of global optimization (Vol. 2, pp. 515–569). Dordrecht: Kluwer Academic Publishers. Pintér, J. D. (2003). Globally optimized calibration of nonlinear models: Techniques, software, and applications. Optimization Methods and Software, 18(3), 335–355. Pintér, J. D. (2005). Nonlinear optimization in modeling environments: Software implementations for compilers, spreadsheets, modeling languages, and integrated computing systems. In V. Jeyakumar & A. M. Rubinov (Eds.), Continuous optimization: Current trends and modern applications (pp. 147–173). New York: Springer Science + Business Media. Pintér, J. D. (2007). Nonlinear optimization with GAMS/LGO. Journal of Global Optimization, 38, 79–101. Pintér, J. D. (2009). Software development for global optimization. In P. M. Pardalos & T. F. Coleman (Eds.), Global optimization: Methods and applications. Fields institute communications (Vol. 55, pp. 183–204). Berlin: Springer. Pintér, J. D. (2010a). LGO – A model development and solver system for nonlinear (global and local) optimization. User’s guide. (Current version) Distributed by Pintér Consulting Services, Inc., Canada. . Pintér, J. D. (2010b). Spreadsheet-based modeling and nonlinear optimization with Excel-LGO. User’s guide. With technical contributions by B.C. Sßal and F.J. Kampas (Current version) Distributed by Pintér Consulting Services, Inc., Canada; . Pintér, J.D. (2010c). MATLAB-LGO User’s Guide. With technical contributions by B.C. S ß al and F.J. Kampas. (Current version) Distributed by Pintér Consulting Services, Inc., Canada. . Pintér, J. D., & Kampas, F. J. (2005). Nonlinear optimization in Mathematica with Mathoptimizer Professional. Mathematica in Education and Research, 10(2), 1–18. Pintér, J. D., Linder, D., & Chin, P. (2006). Global Optimization Toolbox for Maple: An introduction with illustrative applications. Optimization Methods and Software, 21(4), 565–582.

Prechelt, L. (1994). PROBEN1  A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakultät fur Informatik, Universität Karlsruhe, Germany. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge, MA: Cambridge University Press. Röpke, K. et al. (Eds.). (2005). DoE – Design of experiments. Munich, Germany: IAV GmbH, Verlag Moderne Industrie. Ó sv Corporate Media, GmbH. Sexton, R. S., Alidaee, B., Dorsey, R. E., & Johnson, J. D. (1998). Global optimization for artificial neural networks: A tabu search application. European Journal of Operational Research, 106, 570–584. Smith, M. (1993). Neural networks for statistical modeling. New York: Van Nostrand Reinhold. Steeb, W.-H. (2005). The nonlinear workbook (3rd ed.). Singapore: World Scientific. Sutskever, I., & Hinton, G. E. (2008). Deep, narrow sigmoid belief networks are universal approximators. Neural Computation, 20, 2629–2636. Tervo, J., Kolmonen, P., Lyyra-Laitinen, T., Pintér, J. D., & Lahtinen, T. (2003). An optimization-based approach to the multiple static delivery technique in radiation therapy. Annals of Operations Research, 119, 205–227. Tilakaratne, Ch. D., Mammadov, M. A., & Morris, S. A. (2008). Predicting trading signals of stock market indices using neural networks. In AI 2008: Advances in artificial intelligence. In W. Wobcke & M. Zhang (Eds.). Lecture notes in computer science (Vol. 5360, pp. 522–531). Berlin, Heidelberg: Springer-Verlag. Toh, K.-A. (1999). Fast deterministic global optimization for FNN training. In Proceedings of IEEE international conference on systems, man, and cybernetics, Tokyo, Japan (pp. V413–V418). Vrahatis, M. N., Androulakis, G. S., Lambrinos, J. N., & Magoulas, G. D. (2000). A class of gradient unconstrained minimization algorithms with adaptive stepsize. Journal of Computational and Applied Mathematics, 114, 367–386. Watkin, T. L. H. (1993). Optimal learning with a neural network. Europhysics Letters, 21(8), 871–876. Wolfram Research (2011). Mathematica (Current version 8.0.1). Distributed by WRI, Champaign, IL. Ye, H., & Lin, Z. (2003). Global optimization of neural network weights using subenergy tunneling function and ripple search. In Proceedings of the IEEE ISCAS, Bangkok, Thailand, May 2003 (pp. V725–V728.