Artificial Neural Networks

Artificial Neural Networks

P1: GLQ/FQW Revised Pages P2: FQP Qu: 00, 00, 00, 00 Encyclopedia of Physical Science and Technology EN001G-837 May 26, 2001 14:44 Artificial N...

261KB Sizes 3 Downloads 191 Views

P1: GLQ/FQW Revised Pages

P2: FQP

Qu: 00, 00, 00, 00

Encyclopedia of Physical Science and Technology

EN001G-837

May 26, 2001

14:44

Artificial Neural Networks Steven Walczak University of Colorado, Denver

Narciso Cerpa University of Talca, Chile

I. II. III. IV. V. VI. VII.

Introduction to Artificial Neural Networks Need for Guidelines Input Variable Selection Learning Method Selection Architecture Design Training Samples Selection Conclusions

GLOSSARY Architecture The several different topologies into which artificial neural networks can be organized. Processing elements or neurons can be interconnected in different ways. Artificial neural network Model that emulates a biological neural network using a reduced set of concepts from a biological neural system. Learning method Algorithm for training the artificial neural network. Processing element An artificial neuron that receives input(s), processes the input(s), and delivers a single output. Summation function Computes the internal stimulation, or activation level, of the artificial neuron. Training sample Training cases that are used to adjust the weight. Transformation function A linear or nonlinear rela-

tionship between the internal activation level and the output. Weight The relative importance of each input to a processing element.

ARTIFICIAL NEURAL NETWORKS (ANNS) have been used to support applications across a variety of business and scientific disciplines during the past years. These computational models of neuronal activity in the brain are defined and illustrated through some brief examples. Neural network designers typically perform extensive knowledge engineering and incorporate a significant amount of domain knowledge into ANNs. Once the input variables present in the neural network’s input vector have been selected, training data for these variables with known output values must be acquired. Recent research has shown that smaller training set sizes produce better performing neural networks, especially fortime-series applications.

631

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

632

Artificial Neural Networks

Summarizing, this article presents an introduction to artificial neural networks and also a general heuristic methodology for designing high-quality ANN solutions to various domain problems.

I. INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS Artificial neural networks (sometimes just called neural networks or connectionist models) provide a means for dealing with complex pattern-oriented problems of both categorization and time-series (trend analysis) types. The nonparametric nature of neural networks enables models to be developed without having any prior knowledge of the distribution of the data population or possible interaction effects between variables as required by commonly used parametric statistical methods. As an example, multiple regression requires that the error term of the regression equation be distributed normally (with a µ = 0) and also be nonheteroscedastic. Another statistical technique that is frequently used for performing categorization is discriminant analysis, but discriminant analysis requires that the predictor variables be multivariate normally distributed. Because such assumptions are removed from ANN models, the ease of developing a domain problem solution is increased with artificial neural networks. Another factor contributing to the

success of ANN applications is their ability to create nonlinear models as well as traditional linear models and, hence, artificial neural network solutions are applicable across a wider range of problem types (both linear and nonlinear). In the following sections, a brief history of artificial neural networks is presented. Next, a detailed examination of the components of an artificial neural network model is given with respect to the design of artificial neural network models of business and scientific domain problems. A. Biological Basis of Artificial Neural Networks Artificial neural networks are a technology based on studies of the brain and nervous system as depicted in Fig. 1. These networks emulate a biological neural network but they use a reduced set of concepts from biological neural systems. Specifically, ANN models simulate the electrical activity of the brain and nervous system. Processing elements (also known as either a neurode or perceptron) are connected to other processing elements. Typically the neurodes are arranged in a layer or vector, with the output of one layer serving as the input to the next layer and possibly other layers. A neurode may be connected to all or a subset of the neurodes in the subsequent layer, with these connections simulating the synaptic connections of the brain. Weighted data signals entering a neurode

FIGURE 1 Sample artificial neural network architecture (not all weights are shown).

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

Artificial Neural Networks

simulate the electrical excitation of a nerve cell and consequently the transference of information within the network or brain. The input values to a processing element, i n , are multiplied by a connection weight, wn,m , that simulates the strengthening of neural pathways in the brain. It is through the adjustment of the connection strengths or weights that learning is emulated in ANNs. All of the weight-adjusted input values to a processing element are then aggregated using a vector to scalar function such as summation (i.e., y = wi j xi ), averaging, input maximum, or mode value to produce a single input value to the neurode. Once the input value is calculated, the processing element then uses a transfer function to produce its output (and consequently the input signals for the next processing layer). The transfer function transforms the neurode’s input value. Typically this transformation involves the use of a sigmoid, hyperbolic-tangent, or other nonlinear function. The process is repeated between layers of processing elements until a final output value, on , or vector of values is produced by the neural network. Theoretically, to simulate the asynchronous activity of the human nervous system, the processing elements of the artificial neural network should also be activated with the weighted input signal in an asynchronous manner. Most software and hardware implementations of artificial neural networks, however, implement a more discretized approach that guarantees that each processing element is activated once for each presentation of a vector of input values. B. History and Resurgence of Artificial Neural Networks The idea of combining multiple processing elements into a network is attributed to McCulloch and Pitts in the early 1940s and Hebb in 1949 is credited with being the first to define a learning rule to explain the behavior of networks of neurons. In the late 1950s, Rosenblatt developed the first perceptron learning algorithm. Soon after Rosenblatt’s discovery, Widrow and Hoff developed a similar learning rule for electronic circuits. Artificial neural network research continued strongly throughout the 1960s. In 1969, Minsky and Papert published their book, Perceptrons, in which they showed the computational limits of single-layer neural networks, which were the type of artificial neural networks being used at that time. The theoretical limitations of perceptron-like networks led to a decrease in funding and subsequently research on artificial neural networks. Finally in 1986, McClelland and Rumelhart and the PDP research group published the Parallel Distributed Processing texts. These new texts published the backpropagation learning algorithm, which enabled multiple

10:32

633 layers of perceptrons to be trained [and thus introduced the hidden layer(s) to artificial neural networks], and was the birth of MLPs (multiple layered perceptrons). Following the discovery of MLPs and the backpropagation algorithm, a revitalization of research and development efforts in artificial neural networks took place. In the past years, ANNs have been used to support applications across a diversity of business and scientific disciplines (e.g., financial, manufacturing, marketing, telecomunications, and biomedical). This proliferation of neural network applications has been facilitated by the emergence of neural networks shells (e.g., Brainmaker, Neuralyst, Neuroshell, and Professional II Plus) and tool add-ins (for SAS, MATLAB, and Excel) that provide developers with the means for specifying the ANN architecture and training the neural network. These shells and add-in tools enable ANN developers to build ANN solutions without requiring an in-depth knowledge of ANN theory or terminology. Please see either of these World Wide Web sites (active on December 31, 2000): http://www.faqs.org/faqs/ai-faq/neural-nets/part6/ or http://www.emsl.pnl.gov:2080/proj/neuron/neural/systems/software.html for additional links to neural network shell software available commercially. Neural networks may use different learning algorithms and we can classify them into two major categories based on the input format: binary-valued input (i.e., 0s and 1s) or continuous-valued input. These two categories can be subdivided into supervised learning and unsupervised learning. As mentioned above, supervised learning algorithms use the difference between the desired and actual output to adjust and finally determine the appropriate weights for the ANN. In a variation of this approach some supervised learning algorithms are informed whether the output for the input is correct and the network adjust its weights with the aims of achieving correct results. Hopfield network (binary) and backpropagation (continuous) are examples of supervised learning algorithms. Unsupervised learning algorithms only receive input stimuli and the network organizes itself with the aim of having hidden processing elements that respond differently to each set of input stimuli. The network does not require information on the correctness of the output. ART I (binary) and Kohonen (continuous) are examples of unsupervised learning algorithms. Neural network applications are frequently viewed as black boxes that mystically determine complex patterns in data. However, ANN designers must perform extensive knowledge engineering and incorporate a significant amount of domain knowledge into artificial neural networks. Successful artificial neural network development requires a deep understanding of the steps involved in designing ANNs.

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

634 ANN design requires the developer to make many decisions such as input values, training and test data set sizes, learning algorithm, network architecture or topology, and transformation function. Several of these decisions are dependent on each other. For example, the ANN architecture and the learning algorithm will determine the type of input value (i.e., binary or continuous). Therefore, it is essential to follow a methodology or a well-defined sequence of steps when designing ANNs. These steps are listed below: r r r r r r r

Determine data to use. Determine input variables. Separate data into training and test sets. Define the network architecture. Select a learning algorithm. Transform variables to network inputs. Train (repeat until ANN error is below acceptable value). r Test (on hold-out sample to validate generalization of the ANN). In the following sections we discuss the need for guidelines, and discuss heuristics for input variable selection, learning method selection, architecture design, and training sample selection. Finally we conclude and summarize a set of guidelines for ANN design.

II. NEED FOR GUIDELINES Artificial neural networks have been applied to a wide variety of business, engineering, medical, and scientific problems. Several research results have shown that ANNs outperform traditional statistical techniques (e.g., regression or logit) as well as other standard machine learning techniques (e.g., the ID3 algorithm) for a large class of problem types. Many of these ANN applications such as financial time series, e.g., foreign exchange rate forecasts, are difficult to model. Artificial neural networks provide a valuable tool for building nonlinear models of data, especially when the underlying laws governing the system are unknown. Artificial neural network forecasting models have outperformed both statistical and other machine learning models of financial time series, achieving forecast accuracies of more than 60% and thus are being widely used to model the behavior of financial time series. Other categorizationbased applications of ANNs are achieving success rates of well over 90%. Development of effective neural network models is difficult. Most artificial neural network designers develop multiple neural network solutions with regard to the net-

Artificial Neural Networks

work’s architecture—quantity of nodes and arrangement in hidden layers. Two critical design issues are still a challenge for artificial neural networks developers: selection of appropriate input variables and capturing a sufficient quantity of training examples to permit the neural network to adequately model the application. Many different types of ANN applications have being developed in the past several years and are continuing to be developed. Industrial applications exist in the financial, manufacturing, marketing, telecommunications, biomedical, and many other domains. While business managers are seeking to develop new applications using ANNs, a basic misunderstanding of the source of intelligence in an ANN exists. As mentioned above, the development of new ANN applications has been facilitated by the emergence of a variety of neural network shells that allow anyone to produce neural network systems by simply specifying the ANN architecture and providing a set of training data to be used by the shell to train the ANN. These shell-based neural networks may fail or produce suboptimal results unless a deeper understanding of how to use and incorporate domain knowledge in the ANN is obtained by the designers of ANNs in business and industrial domains. The traditional view of an ANN is of a program that emulates biological neural networks and “learns” to recognize patterns or categorize input data by being trained on a set of sample data from the domain. These programs learn through training and subsequently have the ability to generalize broad categories from specific examples. This is the unique perceived source of intelligence in an ANN. However, experienced ANN application designers typically perform extensive knowledge engineering and incorporate a significant amount of domain knowledge into the design of ANNs even before the learning through training process has begun. The selection of the input variables to be used by the ANN is quite a complex task, due to the misconception that the more input a network is fed the more successful the results produced. This is only true if the information fed is critical to making the decisions; however, noisy input variables commonly result in very poor generalization performance. Design of optimal neural networks is problematic in that there exist a large number of alternative ANN physical architectures and learning methods, all of which may be applied to a given domain problem. Selecting the appropriate size of the training data set presents another challenge, since it implies direct and indirect costs, and it can also affect the generalization performance. A general heuristic or rule of thumb for the design of neural networks in time-series domains is that the more knowledge that is available to the neural network for forming its model, the better the ultimate performance of the

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

635

Artificial Neural Networks

neural network. A minimum of 2 years of training data is considered to be a nominal starting point for financial time series. Times-series models are considered to improve as more data are incorporated into the modeling process. Research has indicated that currency exchange rates have a long-term memory, implying that larger periods of time (data) will produce more comprehensive models and produce better generalization. However, this has been challenged in recent research and will be discussed in Section VI. Neural network researchers have built forecasting and trading systems with training data from 1 to 16 years, including various training set sizes in between the two extremes. However, researchers typically use all of the data in building the neural network forecasting model, with no attempt at comparing data quantity effects on the quality of the produced forecasting models. In this article, a set of guidelines for incorporating knowledge into an ANN and using domain knowledge to design optimal ANNs is described. The guidelines for designing ANNs are made up of the following steps: knowledge-based selection of input values, selection of a learning method, architecture design, and training sample selection. The majority of the ANN design steps described will focus mainly on feed-forward supervised learning (and more specifically backpropagation) ANN applications. Following these guidelines will enable developers and researchers to take advantage of the power of ANNs and will afford economic benefit by producing an ANN that outperforms similar ANNs with improperly specified design parameters. Artificial neural network designers must determine the optimal set of design criteria specified as follows: r Appropriate input (independent) variables. r Best learning method: Learning methods can be

classified into either supervised or unsupervised learning methods. Within these learning methods there are many alternatives, each of which is appropriate for different distributions or types of data. r Appropriate architecture: The number of hidden layers depending on the selected learning method; the quantity of processing elements (nodes) per hidden layer. r Appropriate amount of training data: Time series and classification problems. The designer’s choices for these design criteria will affect the performance of the resulting ANN on out-of-sample data. Inappropriate selection of the values for these design factors may produce ANN applications that perform worse than random selection of an output (dependent) value.

III. INPUT VARIABLE SELECTION The generalization performance of supervised learning artificial neural networks (e.g., backpropagation) usually improves when the network size is minimized with respect to the weighted connections between processing nodes (elements of the input, hidden, and output layers). ANNs that are too large tend to overfit or memorize the input data. Conversely, ANNs with too few weighted connections do not contain enough processing elements to correctly model the input data set, underfitting the data. Both of these situations result in poor out-of-sample generalization. Therefore, when developing supervised learning neural networks (e.g., backpropagation, radial basis function, or fuzzy ARTMAP), the developer must determine what input variables should be selected to accurately model the domain. ANN designers must spend a significant amount of time performing the task of knowledge acquisition to avoid the fact that “garbage in, garbage out” also applies to ANN applications. ANNs as well as other artificial intelligence (AI) techniques are highly dependent on the specification of input variables. However, ANN designers tend to misspecify input variables. Input variable misspecification occurs because ANN designers follow the expert system approach of incorporating as much domain knowledge as possible into an intelligent system. ANN performance improves as additional domain knowledge is provided through the input variables. This belief is correct, because if a sufficient amount of information representing critical decision criteria is not given to an ANN, it cannot develop a correct model of the domain. Most ANN designers believe that since ANNs learn, they will be able to determine those input variables that are important and develop a corresponding model through the modification of the weights associated with the connections between the input layer and the hidden layers. Noise input variables produce poor generalization performance in ANNs. The presence of too many input variables causes poor generalization when the ANN not only models the true predictors, but also includes the noise variables in the model. Interaction between input variables produces critical differences in output values, further obscuring the ideal problem model when unnecessary variables are included in the set of input values. As indicated above and shown in the following sections both under- and overspecification of input variables produce suboptimal performance. The following section describes the guidelines for selecting input (independent) variables for an ANN solution to a domain problem.

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

636

Artificial Neural Networks

A. Determination of Input Variables Two approaches exist regarding the selection of input parameter variables for supervised learning neural networks. In the first approach, it is thought that since a neural network that utilizes supervised training will adjust its connection weights to better approximate the desired output values, then all possible domain-relevant variables should be given to the neural network as input values. The idea is that the connection weights that indicate the contribution of nonsignificant variables will approach zero and thus effectively eliminate any effect on the output value from these variables lim εt ⇒ 0,

t→∞

where ε is the error term of the neural network and t is the number of training iterations. The second approach emphasizes the fact that the weighted connections never achieve a value of true zero and thus there will always be some contribution to the output value of the neural network by all of the input variables. Hence, ANN designers must research domain variables to determine their potential contribution to the desired output values. Selection of input variables for neural networks is a complex, but necessary task. Selection of irrelevant variables may cause output value fluctuations of up to 7%. Designers should determine applicability through knowledge acquisition of experts in the domain, similar to expert systems development. Highly correlated variables should be removed from the input vector because they can multiply the effect of those variables and consequently cause noise in the output values. This process should produce an expert-specified set of significant variables that are not intercorrelated, and which will yield the optimal performance for supervised learning neural networks. The first step in determining the optimal set of input variables is to perform standard knowledge acquisition. Typically, this involves consultation with multiple domain experts. Various researchers have indicated the requirement for extensive knowledge acquisition utilizing domain experts to specify ANN input variables. The primary purpose of the knowledge acquisition phase is to guarantee that the input variable set is not underspecified, providing all relevant domain criteria to the ANN. Once a base set of input variables is defined through knowledge acquisition, the set can be pruned to eliminate variables that contribute noise to the ANN and consequently reduce the ANN generalization performance. ANN input variables need to be predictive, but should not be correlated. Correlated variables degrade ANN performance by interacting with each other as well as other elements to produce a biased effect. The designer should calculate the correlation of pairs of variables—Pearson

correlation matrix—to identify “noise” variables. If two variables have a high correlation, then one of these two variables may be removed from the set of variables without adversely affecting the ANN performance. Alternatively, a chi-square test may be used for categorical variables. The cutoff value for variable elimination is an arbitrary value and must be determined separately for every ANN application, but any correlation absolute value of 0.20 or higher indicates a probable noise source to the ANN. Additional statistical techniques may be applied, depending on the distribution properties of the data set. Stepwise multiple or logistic regression and factor analysis provide viable tools for evaluating the predictive value of input variables and may serve as a secondary filter to the Pearson correlation matrix. Multiple regression and factor analysis perform best with normally distributed linear data, while logistic regression assumes a curvilinear relationship. Several researchers have shown that smaller input variable sets can produce better generalization performance by an ANN. As mentioned above, high correlation values of variables that share a common element need to be disregarded. Smaller input variable sets frequently improve the ANN generalization performance and reduce the net cost of data acquisition for development and usage of the ANN. However, care must be taken when removing variables from the ANN’s input set to ensure that a complete set of noncorrelated predictor variables is available for the ANN, otherwise the reduced variable sets may worsen generalization performance.

IV. LEARNING METHOD SELECTION After determining a heuristically optimal set of input variables using the methods from the previous section, an ANN learning method must be selected. The learning method is what enables the ANNs to correctly model categorization and time-series problems. Artificial neural network learning methods can be divided into two distinct categories: unsupervised learning and supervised learning. Both unsupervised and supervised learning methods require a collection of training examples that enable the ANN to model the data set and produce accurate output values. Unsupervised learning systems, such as adaptive resonance theory (ART), self-organizing map (SOM, also called Kohonen networks), or Hopfield networks, do not require that the output value for a training sample be provided at the time of training. Supervised learning systems, such as backpropagation (MLP), radial basis function (RBF), counterpropagation,

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

637

Artificial Neural Networks

FIGURE 2 Kohonen layer (12-node) learning of a square

or fuzzy ARTMAP networks, require that a known output value for all training samples be provided to the ANN. Unsupervised learning methods determine output values directly from the input variable data set. Most unsupervised learning methods have less computational complexity and less generalization accuracy than supervised methods, because the answers must be contained within or directly learned from the input values. Hence, unsupervised learning techniques are typically used for classification problems, where the desired classes are selfdescriptive. For example, the ART algorithm is a good technique to use for performing object recognition in pictorial or graphical data. An example of a problem that has been solved with ART-based ANNs is the recognition of hand-written numerals. The hand-written numerals 0–9 are each unique, although in some cases similar for example 1 and 7 or 3 and 8, and define the pattern to be learned: the shapes of the numerals 0–9. The advantage of using unsupervised learning methods is that these ANNs can be designed to learn much more rapidly than supervised learning systems. A. Unsupervised Learning The unsupervised learning algorithms—ART, SOM (Kohonen), and Hopfield—form categories based on the input data. Typically, this requires a presentation of each of the training examples to the unsupervised learning ANN. Distinct categories of the input vector are formed and reformed as new input examples are presented to the ANN. The ART learning algorithm establishes a category for the initial training example. As additional examples are presented to the ART-based ANN, new categories are formed based on how closely the new example matches one of the existing categories with respect to both negative inhibition and positive excitation of the neurodes in the network. As a worst case, an ART-trained ANN may produce M distinct categories for M input examples. When building ART-based networks, the architecture of the network is given explicitly by the quantity of input values and the desired number of categories (output values). The hidden or what is usually called the F1 layer is the same size as the input layer and serves as the feature detector for the categories. The output or F2 layer is defined by the quantity of categories to be defined.

SOM-trained networks are composed of a Kohonen layer of neurodes that are two dimensional as opposed to the vector alignments of most other ANNs. The collection of neurodes (also called the grid) maps input values onto the grid of neurodes to preserve order, which means that two input values that are close together will be mapped to the same neurode. The Kohonen grid is connected to both an input and output layer. As training progresses, the neurodes in the grid attempt to approximate the feature space of the input by adjusting the collection of values mapped onto each neurode. A graphical example of the learning process in the Kohonen layer of the SOM is shown in Fig. 2 , which is a grid of 12 neurodes (3 × 4) that is trying to learn the category of a hollow square object. Figures 2a–d represent the two-dimensional coordinates of each of the 12 Kohonen-layer processing elements. The Hopfield training algorithm is similar in nature to the ART training algorithm. Both require a hidden layer (in this case called the Hopfield layer as opposed to an F1 layer for ART-based ANNs) that is the same size as the input layer. The Hopfield algorithm is based on spin glass physics and views the state of the network as an energy surface. Both SOM and Hopfield trained ANNs have been used to solve traveling salesman problems in addition to the more traditional image processing of unsupervised learning ANNs. Hopfield ANNs are also used for optimization problems. A difficulty with Hopfield ANNs is the capacity of the network, which is estimated at n/(4 ln n), where n is the number of neurodes in the Hopfield layer. B. Supervised Learning The backpropagation learning algorithm is one of the most popular design choices for implementing ANNs, since this algorithm is available and supported by most commercial neural network shells and is based on a very robust paradigm. Backpropagation-trained ANNs have been shown to be universal approximators, and they are able to learn arbitrary category mappings. Various researchers have supported this finding and shown the superiority of backpropagation-trained ANNs to different ANN learning paradigms including radial basis function (RBF), counterpropagation, and fuzzy adaptive resonance theory. An ANN’s performance has been found to be more dependent

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

638 on data representation than on the selection of a learning rule. Learning rules other than backpropagation perform well if the data from the domain have specific properties. The mathematical specifications of the various ANN learning methods described in this section are available in the reference articles and books given at the end of this article. Backpropagation is the superior learning method when a sufficient number of noise/error-free training examples exist, regardless of the complexity of the specific domain problem. Backpropagation ANNs can handle noise in the training data and they may actually generalize better if some noise is present in the training data. However, too many erroneous training values may prevent the ANN from learning the desired model. For ANN applications that provide only a few training examples or very noisy training data, other supervised learning methods should be selected. RBF networks perform well in domains with limited training sets and counterpropagation networks perform well when a sufficient number of training examples is available, but may contain very noisy data. For resource allocation problems (configuration) backpropagation produced the best results, although the first appearance of the problem indicated that counterpropagation might outperform backpropagation due to anticipated noise in the training data set. Hence, although properties of the data population may strongly indicate the preference of a particular training method, because of the strength of the backpropagation network, this type of learning method should always be tried in addition to any other methods prescribed by domain data tendencies. Domains that have a large collection of relatively errorfree historical examples with known outcomes suit backpropagation ANN implementations. Both the ART and RBF ANNs have worse performance than the backpropagation ANN performance for this specific domain problem. Many other ANN learning methods exist and each is subject to constraints on the type of data that is best processed by that specific learning method. For example, general regression neural networks are capable of solving any problem that can also be solved by a statistical regression model, but does not require that a specific model type (e.g., multiple linear or logistic) be specified in advance. However, regression ANNs suffer from the same constraints as regression models, such as the linear or curvilinear relationship of the data with heteroscedastic error. Likewise, learning vector quantization (LVQ) networks try to divide input values into disjoint categories similar to discriminant analysis and consequently have the same data distribution requirements as discriminant analysis. Research using resource allocation problems has indicated that LVQ

Artificial Neural Networks

neural networks produced the second best allocation results, which indicated into the previously unknown perception that the categories used for allocating resources were unique. To summarize, backpropagation MLP networks are usually implemented due to their robust and generalized problem-solving capabilities. General regression networks are implemented to simulate the statistical regression models. Radial basis function networks are implemented to resolve domain problems having a partial sample or a training data set that is too small. Both counterpropagation and fuzzy ARTMAP networks are implemented to resolve the difficulty of extremely noisy training data. The combination of unsupervised (clustering and ART) learning techniques with supervised learning may improve the performance of neural networks in the noisy domains. Finally, learning vector quantization networks are implemented to exploit the potential for unique decision criteria of disjoint sets. The selection of a learning method is an open problem and ANN designers must use the constraints of the training data set for determining the optimal learning method. If reasonably large quantities of relatively noise-free training examples are available, then backpropagation provides an effective learning method, which is relatively easy to implement.

V. ARCHITECTURE DESIGN The architecture of an ANN consists of the number of layers of processing elements or nodes, including input, output, and any hidden layers, and the quantity of nodes contained in each layer. Selection of input variables (i.e., input vector) was discussed in Section III, and the output vector is normally predefined by the problem to be solved with the ANN. Design of hidden layers is dependent on the selected learning algorithm (discussed in Section IV). For example, unsupervised learning methods such as ART normally require a first hidden layer quantity of nodes equal to the size of the input layer. Supervised learning systems are generally more flexible in the design of hidden layers. The remaining discussion focuses on backpropagation ANN systems or other similar supervised learning ANNs. The designer should determine the following aspects regarding the hidden layers of the ANN architecture: (1) number of hidden layers and (2) number of nodes in the hidden layer(s). A. Number of Hidden Layers It is possible to design an ANN with no hidden layers, but these types of ANNs can only classify input data that

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

639

Artificial Neural Networks

is linearly separable, which severely limits their application. Artificial neural networks that contain hidden layers have the ability to deal robustly with nonlinear and complex problems and therefore can operate on more interesting problems. The quantity of hidden layers is associated with the complexity of the domain problem to be solved. ANNs with a single hidden layer create a hyperplane. ANNs with two hidden layers combine hyperplanes to form convex decision areas and ANNs with three hidden layers combine convex decision areas to form convex decision areas that contain concave regions. The convexity or concavity of a decision region corresponds roughly to the number of unique inferences or abstractions that are performed on the input variables to produce the desired output result. Increasing the number of hidden unit layers enables a trade-off between smoothness and closeness-of-fit. A greater quantity of hidden layers enables an ANN to improve its closeness-of-fit, while a smaller quantity improves the smoothness or extrapolation capabilities of the ANN. Several researchers have indicated that a single hidden layer architecture, with an arbitrarily large quantity of hidden nodes in the single layer, is capable of modeling any categorization mapping. On the other hand two hidden layer networks outperform their single hidden layer counterparts for specific problems. A heuristic for determining the quantity of hidden layers required by an ANN is as follows: “As the dimensionality of the problem space increases—higher order problems—the number of hidden layers should increase correspondingly.” The number of hidden layers is heuristically set by determining the number of intermediate steps, dependent on previous categorizations, required to translate the input variables into an output value. Therefore, domain problems that have a standard nonlinear equation solution are solvable by a single hidden layer ANN. B. Number of Nodes per Hidden Layer When choosing the number of nodes to be contained in a hidden layer, there is a trade-off between training time and the accuracy of training. A greater number of hidden unit nodes results in a longer (slower) training period, while fewer hidden units provide shorter (faster) training, but at the cost of having fewer feature detectors. Too many hidden nodes in an ANN enable it to memorize the training data set, which produces poor generalization performance. Some of the heuristics used for selecting the quantity of hidden nodes for an ANN are using: r 75 percent of the quantity of input nodes, r 50 percent of the quantity of input and output nodes, or

r 2n + 1 hidden layer nodes where n is the number of

nodes in the input layer. These algorithmic heuristics do not utilize domain knowledge for estimating the quantity of hidden nodes and may be counterproductive. As with the knowledge acquisition and elimination of correlated input variables heuristic for defining the optimal input node set, the number of decision factors (DFs) heuristically determines the optimal number of hidden units for an ANN. Knowledge acquisition or existing knowledge bases may be used to determine the DFs for a particular domain and consequently the hidden layer architecture and optimal quantity of hidden nodes. Decision factors are the separable elements that help to form the unique categories of the input vector space. The DFs are comparable to the collection of heuristic production rules used in an expert system. An example of the DF design principle is provided by the NETTalk neural network research project. NETTalk has 203 input nodes representing seven textual characters, and 33 output units representing the phonetic notation of the spoken text words. Hidden units are varied from 0 to 120. NETTalk improved output accuracy as the number of hidden units was increased from 0 to 120, but only a minimal improvement in the output accuracy was observed between 60 and 120 hidden units. This indicates that the ideal quantity of DFs for the NETTalk problem was around 60; adding hidden units beyond 60 increased the training time, but did not provide any appreciable difference in the ANN’s performance. Several researchers have found that ANNs perform poorly until a sufficient number of hidden units is available to represent the correlations between the input vector and the desired output values. Increasing the number of hidden units beyond the sufficient number served to increase training time without a corresponding increase in output accuracy. Knowledge acquisition is necessary to determine the optimal input variable set to be used in an ANN system. During the knowledge acquisition phase, additional knowledge engineering can be performed to determine the DFs and subsequently the minimum number of hidden units required by the ANN architecture. The ANN designer must acquire the heuristic rules or clustering methods used by domain experts, similar to the knowledge that must be acquired during the knowledge acquisition process for expert systems. The number of heuristic rules or clusters used by domain experts is equivalent to the DFs used in the domain. Researchers have explored and shown techniques for automatically producing an ANN architecture with the exact number of hidden units required to model the DFs for

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

640 the problem space. The approach used by these automatic methods consists of three steps: 1. Initially create a neural network architecture with a very small or very large number of hidden units. 2. Train the network for some predetermined number of epochs. 3. Evaluate the error of the output nodes. If the error exceeds a set threshold value, then a hidden unit is added or deleted, respectively, and the process is repeated until the error term is less than the threshold value. Another method to automatically determine the optimum architecture is to use genetic algorithms to generate multiple ANN architectures and select the architectures with the best performance. Determining the optimum number of hidden units for an ANN application is a very complex problem, and an accurate method for automatically determining the DF quantity of hidden units without performing the corresponding knowledge acquisition remains a current research topic. In this section, the heuristic architecture design principle of acquiring decision factors to determine the quantity of hidden nodes and the configuration of hidden layers has been presented. A number of hidden nodes equal to the number of the DFs is required by an ANN to perform robustly in a domain and produce accurate results. This concept is similar to the principle of a minimum size input vector determined through knowledge acquisition presented in Section III. The knowledge acquisition process for ANN designers must acquire the heuristic decision rules or clustering methods of domain experts. The DFs for a domain are equivalent to the heuristic decision rules used by domain experts. Further analysis of the DFs to determine the dimensionality of the problem space enables the knowledge engineer to configure the hidden nodes into the optimal number of hidden layers for efficient modeling of the problem space.

VI. TRAINING SAMPLES SELECTION Acquisition of training data has direct costs associated with the data themselves, and indirect costs due to the fact that larger training sets require a larger quantity of training epochs to optimize the neural network’s learning. The common belief is that the generalization performance of a neural network will increase when larger quantities of training samples are used to train the neural network, especially for time-series applications of neural networks. Based on this belief, the neural network designer must acquire as much data as possible to ensure the optimal learning of a neural network.

Artificial Neural Networks

A “rule of thumb” lower bound on the number of training examples required to train a backpropagation ANN is four times the number of weighted connections contained in the network. Therefore, if a training database contains only 100 training examples, the maximum size of the ANN is 25 connections or approximately 10 nodes depending on the ANN architecture. While the general heuristic of four times the number of connections is applicable to most classification problems, time-series problems, including the prediction of financial time series (e.g., stock values), are more dependent on business cycles. Recent research has conclusively shown that a maximum of 1 or 2 years of data is all that is required to produce optimal forecasting results for ANNs performing financial time-series prediction. Another issue to be considered during training sample selection is how well the samples in the training set model the real world. If training samples are skewed such that they only cover a small portion of the possible real-world instances that a neural network will be asked to classify or predict, then the neural network can only learn how to classify or predict results for this subset of the domain. Therefore, developers should take care to ensure that their training set samples have a similar distribution to the domain in which the neural network must operate. Artificial neural network training sets should be representative of the population-at-large. This indicates that categorization-based ANNs require at least one example of each category to be classified and that the distribution of training data should approximate the distribution of the population at large. A small amount of additional examples from each category will help to improve the generalization performance of the ANN. Thus a categorization ANN trying to classify items into one of seven categories with distributions of 5, 10, 10, 15, 15, 20, and 25% would need a minimum of 20 training examples, but would benefit by having 40–100 training examples. Timeseries domain problems are dependent on the distribution of the time series, with the neural network normally requiring one complete cycle of data. Again, recent research in financial time series has demonstrated that 1- and 2-year cycle times are prevalent and thus the minimum required training data for a financial time-series ANN would be from 1 to 2 years of training examples. Based on these more recent findings we suggest that neural network developers should use an iterative approach to training. Starting with a small quantity of training data, train the neural network and then increase the quantity of samples in the training data set and repeat training until a decrease in performance occurs. Development of optimal neural networks is a difficult and complex task. Limiting both the set of input variables to those that are thought to be predictive and the training

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

641

Artificial Neural Networks

set size increases the probability of developing robust and highly accurate neural network models. Most neural network models of financial time series are homogeneous. Homogeneous models utilize data from the specific time series being forecast or directly obtainable from that time series (e.g., a k-day trend or moving average). Heterogeneous models utilize information from outside the time series in addition to the time series itself. Homogeneous models rely on the predictive capabilities of the time series itself, corresponding to a technical analysis as opposed to a fundamental analysis. Most neural network forecasting in the capital markets produces an output value that is the future price or exchange rate. Measuring the mean standard error of these neural networks may produce misleading evaluations of the neural networks’ capabilities, since even very small errors that are incorrect in the direction of change will result in a capital loss. Instead of measuring the mean standard error of a forecast, some researchers argue that a better method for measuring the performance of neural networks is to analyze the direction of change. The direction of change is calculated by subtracting today’s price from the forecast price and determining the sign (positive or negative) of the result. The percentage of correct direction of change forecasts is equivalent to the percentage of profitable trades enabled by the ANN system. The effect on the quality of the neural network model forecasting outputs achieved from the quantities of training data has been called the “time-series (TS) recency effect.” The TS recency effect states that for time-series data, model construction data that are closer in time to the values to be forecast produce better forecasting models. This effect is similar to the concept of a random walk model that assumes future values are only affected by the previous time period’s value, but able to use a wider range of proximal data for formulating the forecasts. Requirements for training or modeling knowledge were investigated when building nonlinear financial time-series forecasting models with neural networks. Homogeneous neural network forecasting models were developed for trading the U.S. dollar against various other foreign currencies (i.e., dollar/pound, dollar/mark, dollar/yen). Various training sets were used, ranging from 22 years to 1 year of historic training data. The differences between the neural network models for a specific currency existed only in the quantity of training data used to develop each time-series forecasting model. The researchers critically examined the qualitative effect of training set size on neural network foreign exchange rate forecasting models. Training data sets of up to 22 years of data are used to predict 1-day future spot rates for several nominal exchange rates. Multiple neural network forecasting models for each exchange rate forecasting model were trained on

incrementally larger quantities of training data. The resulting outputs were used to empirically evaluate whether neural network exchange rate forecasting models achieve optimal performance in the presence of a critical amount of data used to train the network. Once this critical quantity of data is obtained, addition of more training data does not improve and may, in fact, hinder the forecasting performance of the neural network forecasting model. For most exchange rate predictions, a maximum of 2 years of training data produces the best neural network forecasting model performance. Hence, this finding leads to the induction of the empirical hypothesis for a time-series recency effect. The TS recency effect can be summarized in the following statement: “The use of data that are closer in time to the data that are to be forecast by the model produces a higher quality model.” The TS recency effect provides several direct benefits for both neural network researchers and developers: r A new paradigm for choosing training samples for

producing a time-series model

r Higher quality models by having better forecasting

performance through the use of smaller quantities of data r Lower development costs for neural network time-series models because fewer training data are required r Less development time because smaller training set sizes typically require fewer training iterations to accurately model the training data. The time-series recency effect refutes existing heuristics and is a call to revise previous claims of longevity effects in financial time series. The empirical method used to evaluate and determine the critical quantity of training data for exchange rate forecasting is generalized for application to other financial time series, indicating the generality of the TS recency effect to other financial time series. The TS recency effect offers an explanation as to why previous research efforts using neural network models have not surpassed the 60% prediction accuracy demonstrated as a realistic threshold by researchers. The difficulty in most prior neural network research is that too much data is typically used. In attempting to build the best possible forecasting model, as was perceived at that time, too much training data is used (typically 4–6 years of data), thus violating the TS recency effect by introducing data into the model that is not representative of the current time-series behavior. Training, test, and general use data represent an important and recurring cost for information systems in general and neural networks in particular. Thus, if the 2-year training set produces the best performance and

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

642 represents the minimal quantity of data required to achieve this level of performance, then this minimal amount of data is all that should be used to minimize the costs of neural network development and maintenance. For example, the Chicago Mercantile Exchange (CME) sells historical data on commodities (including currency exchange rates) at the cost of $100 per year per commodity. At this rate, using 1–2 years of data instead of the full 22 years of data provides an immediate data cost savings of $2000 to $2100 for producing the neural network models. The only variation in the ANN models above was the quantity of data used to build the ANN models. It may be argued that certain years of training data contain noise and would thus adversely affect the forecasting performance of the neural network model. In such case, the addition of more training data (older) that is error free should compensate for the noise effects in middle data, creating a U-shaped performance curve. The most recent data provide high performance and the largest quantity of data available also provides high performance due to drowning out the noise in middle-time frame samples. The TS recency effect has been demonstrated for the three most widely traded currencies against the U.S. dollar. These results contradict current approaches which state that as the quantity of training data used in constructing neural network models increases, the forecasting performance of the neural networks correspondingly improves. The results were tested for robustness by extending the research method to other foreign currencies. Three additional currencies were selected: the French franc, the Swiss franc, and the Italian lira. These three currencies were chosen to approximate the set of nominal currencies used in previous study. Results for the six different ANN models for each of the three new currencies show that the full 22-year training data set continues to be outperformed by either the 1- or 2-year training sets. This is excluding the French franc, which has equivalent performance for the most recent and the largest training data sets. The result that the 22-year data set cannot outperform the smaller 1- or 2-year training data sets provides further empirical evidence that a critical amount of training data, less than the full 22 years for the foreign exchange time series, produces optimal performance for neural network financial time-series models. The French franc, similar to the Japanese yen, ANN models have identical performance between the largest (22-year) data set and the smallest (1-year) data set. Because no increase in performance is provided through the use of additional data, economics dictates that the smaller 1-year set be used as the training paradigm for the French franc, producing a possible $2100 savings in data costs.

Artificial Neural Networks

Additionally, the TS recency effect is supported by all three currencies; however, the Swiss franc achieves its maximum performance with 4 years of training data. The quality of the ANN outputs for the Swiss franc model continually increases as new training data years are added, through the fourth year, then precipitously drops in performance as additional data are added to the training set. Again, the Swiss franc results still support the research goal of determining a critical training set size and the discovered TS recency effect. However, the Swiss franc results indicate that validation tests should be performed individually for all financial time series to determine the minimum quantity of data required for producing the best forecasting performance. While a significant amount of evidence has been acquired to support the TS recency effect for ANN models of foreign exchange rates, can the TS recency effect be generalized to apply to other financial time series? The knowledge that only a few years of data are necessary to construct neural network models with maximum forecasting performance would serve to save neural network developers significant development time, effort, and costs. On the other hand, the dollar/Swiss franc ANNs described above indicate that a cutoff of 2 years of training data may not always be appropriate. A method for determining the optimal training set size for financial time series ANN models has been proposed. This method consists of the following steps: 1. Create 1-year training set using most recent data; determine appropriate test set. 2. Train with 1-year set and test (baseline); record performance. 3. Add 1 year of training data; the closest to current training set. 4. Train with newest training set, and test on original test set; record performance. 5. If the performance of the newest training set is better than previous performance, Then Go to step 5 Otherwise Use the previous training data set, which produced the best performance. This is an iterative approach that starts with a single year of training data and continues to add additional years of training data until the trained neural network’s performance begins to decrease. In other words, the process continues to search for better training set sizes as long as the performance increases or remains the same. The optimal training set size is then set to be the smallest quantity of training data to achieve the best forecasting performance.

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

643

Artificial Neural Networks

Because the described method is a result of the empirical evidence acquired using foreign exchange rates, it stands to reason that testing the method on additional neural network foreign exchange rate forecasting models would continue to validate the method. Therefore, three new financial time series were used to demonstrate the robustness of the specified method. The DJIA stock index closing values, the closing price for the individual DIS (Walt Disney Co.) stock, and the CAC-40 French stock index closing values served as the three new financial time series. Data samples from January 1977 to August 1999, to simulate the 22 years of data used in the foreign exchange neural network training, were used for the DJIA and DIS time series and data values from August 1988 to May 1999 were used for the CAC-40 index. Following the method discussed above, three backpropagation ANNs, one for each of the two time series, were trained on the 1998 data set and tested a single time on the 1999 data values (164 cases for the DJIA and DIS; 123 cases for the CAC-40). Then a single year was added to the training set, a new ANN model was trained and tested a single time, with the process repeated until a decrease in forecasting performance occurred. An additional 3 years of training data, in 1-year increments, were added to the training sets and evaluated to strengthen the conclusion that the optimal training set size has been acquired. A final test of the usefulness of the generalized method for determining minimum optimal training set sizes was performed by training similar neural network models on the full 22-year training set for the DJIA index and DIS stock ANNs and on the 10-year training set for all networks, which was the maximum data quantity available for the CAC-40. Then each of the ANNs trained on the “largest” training sets was tested on the 1999 test data set to evaluate the forecasting performance. For both the DJIA and the DIS stock, the 1-year training data set was immediately identified as the best size for a training data set as soon as the ANN trained on the 2-year data set was tested. The CAC-40 ANN forecasting model, however, achieved its best performance with a 2-year training data set size. While the forecasting accuracy for these three new financial time series did not achieve the 60% forecasting accuracy as do many of the foreign exchange forecasting ANNs, it did support the generalized method for determining minimum necessary training data sets and consequently lends support to the time-series recency effect. Once the correct or best performing minimum training set was identified by the generalized method, no other ANN model trained on a larger size training set was able to outperform the “minimum” training set. The results for the DIS stock value are slightly better. Conclusions were that the ANN model, which used ap-

proximately 4 years of training data, emulated a simple efficient market. A random walk model of the DIS stock produced a 50% prediction accuracy and so the DIS artificial neural network forecasting model did outperform the random walk model, but not by a statistically significant amount. An improvement to the ANN model to predict stock price changes may be achieved by following the generalized method for determining the best size training set and reducing the overall quantity of training data, thus limiting the effect of nonrelevant data. Again as an alternative evaluation mechanism, a simulation is run with the CAC-40 stock index data. A starting value of $10,000 with sufficient funds and/or credit is assumed to enable a position on 100 index options contracts. Options are purchased or sold consistent with the ANN forecasts for the direction of change in the CAC-40 index. All options contracts are sold at the end of the year-long simulation. The two-year training data set model produces a net gain of $16,790, while using the full 10-year training data set produces a net loss of $15,010. The simulation results yield a net average difference between the TS recency effect model (2 years) and the heuristic greatest quantity model (10 years) of $31,800, or three times the size of the initial investment.

VII. CONCLUSIONS General guidelines for the development of artificial neural networks are few, so this article presents several heuristics for developing ANNs that produce optimal generalization performance. Extensive knowledge acquisition is the key to the design of ANNs. First, the correct input vector for the ANN must be determined by capturing all relevant decision criteria used by domain experts for solving the domain problem to be modeled by the ANN and eliminating correlated variables. Second, the selection of a learning method is an open problem and an appropriate learning method can be selected by examining the set of constraints imposed by the collection of available training examples for training the ANN. Third, the architecture of the hidden layers is determined by further analyzing a domain expert’s clustering of the input variables or heuristic rules for producing an output value from the input variables. The collection of clustering/decision heuristics used by the domain expert has been called the set of decision factors (DFs). The quantity of DFs is equivalent to the minimum number of hidden units required by an ANN to correctly represent the problem space of the domain. Use of the knowledge-based design heuristics enables an ANN designer to build a minimum size ANN that is

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

10:32

644 capable of robustly dealing with specific domain problems. The future may hold automatic methods for determining the optimum configuration of the hidden layers for ANNs. Minimum size ANN configurations guarantee optimal results with the minimum amount of training time. Finally, a new time-series model effect, termed the time-series recency effect, has been described and demonstrated to work consistently across six different currency exchange time series ANN models. The TS recency effect claims that model building data that is nearer in time to the out-of-sample values to be forecast produce more accurate forecasting models. The empirical results discussed in this article show that frequently, a smaller quantity of training data will produce a better performing backpropagation neural network model of a financial time series. Research indicates that for financial time series 2 years of training data are frequently all that is required to produce optimal forecasting accuracy. Results from the Swiss franc models alert the neural network researcher that the TS recency effect may extend beyond 2 years. A generalized method is presented for determining the minimum training set size that produces the best forecasting performance. Neural network researchers and developers using the generalized method for determining the minimum necessary training set size will be able to implement artificial neural networks with the highest forecasting performance at the least cost. Future research can continue to provide evidence for the TS recency effect by examining the effect of training set size for additional financial time series (e.g., any other stock or commodity and any other index value). The TS recency effect may not be limited only to financial time series; evidence from nonfinancial time-series domain neural network implementations already indicates that smaller quantities of more recent modeling data are capable of producing high-performance forecasting models. Additionally, the TS recency effect has been demonstrated with neural network models trained using backpropagation. The common belief is that the TS recency effect holds for all supervised learning neural network training algorithms (e.g., radial basis function, fuzzy ARTMAP, probabilistic) and is therefore a general principle for time-series modeling and not restricted to backpropagation neural network models. In conclusion, it has been noted that ANN systems incur costs from training data. This cost is not only financial, but also has an impact on the development time and effort. Empirical evidence demonstrates that frequently only 1 or 2 years of training data will produce the “best” performing backpropagation trained neural network forecasting models. The proposed method for identifying the minimum necessary training set size for optimal performance enables neural network researchers and implementers to

Artificial Neural Networks

develop the highest quality financial time-series forecasting models in the shortest amount of time and at the lowest cost. Therefore, the set of general guidelines for designing ANNs can be summarized as follows: 1. Perform extensive knowledge acquisition. This knowledge acquisition should be targeted at identifying the necessary domain information required for solving the problem and identifying the decision factors that are used by domain experts for solving the type of problem to be modeled by the ANN. 2. Remove noise variables. Identify highly correlated variables via a Pearson correlation matrix or chi-square test, and keep only one correlated variable. Identify and remove noncontributing variables, depending on data distribution and type, via discriminant/factor analysis or step-wise regression. 3. Select an ANN learning method, based on the demographic features of the data and decision problem. If supervised learning methods are applicable, then implement backpropagation in addition to any other method indicated by the data demographics (i.e., radial-basis function for small training sets or counterpropagation for very noisy training data). 4. Determine the amount of training data. Follow the methodology described in Section VI for time series. Four times the number of weighted connections for classification problems. 5. Determine the number of hidden layers. Analyze the complexity, and number of unique steps, of the traditional expert decision-making solution. If in doubt, then use a single hidden layer, but realize that additional nodes may be required to adequately model the domain problem. 6. Set the quantity of hidden nodes in the last hidden layer equal to the decision factors used by domain experts to solve the problem. Use the knowledge acquired during step 1 of this set of guidelines.

SEE ALSO THE FOLLOWING ARTICLES ARTIFICIAL INTELLIGENCE • COMPUTER NETWORKS • EVOLUTIONARY ALGORITHMS AND METAHEURISTICS

BIBLIOGRAPHY Bansal, A., Kauffman, R. J., and Weitz, R. R. (1993). “Comparing the modeling performance of regression and neural networks as data quality varies: A business value approach,” J. Management Infor. Syst. 10 (1), 11–32.

P1: GLQ/FQW Revised Pages

P2: FQP

Encyclopedia of Physical Science and Technology

EN001G-837

May 10, 2001

Artificial Neural Networks Barnard, E., and Wessels, L. (1992). “Extrapolation and interpolation in neural network classifiers,” IEEE Control Syst. 12 (5), 50–53. Carpenter, G. A., and Grossberg, S. (1998). “The ART of adaptive pattern recognition by a self-organizing neural network,” Computer, 21 (3), 77–88. Carpenter, G. A., Grossberg, S., Markuzon, N., and Reynolds, J. H. (1992). “Fuzzy ARTMAP: A neural network architecture for incremental learning of analog multidimensional maps,” IEEE Trans. Neural Networks 3 (5), 698–712. Dayhoff, J. (1990). “Neural Network Architectures: An Introduction,” Van Nostrand Reinhold, New York. Fu, L. (1996). “Neural Networks in Computer Intelligence,” McGrawHill, New York. Gately, E. (1996). “Neural Networks for Financial Forecasting,” Wiley, New York. Hammerstrom, D. (1993). “Neural networks at work,” IEEE Spectrum 30 (6), 26–32. Haykin, S. (1994). “Neural Networks: A Comprehensive Foundation,” Macmillan, New York. Hecht-Nielsen, R. (1988). “Applications of counterpropagation networks,” Neural Networks 1, 131–139. Hertz, J., Krogh, A., and Palmer, R. (1991). “Introduction to the Theory of Neural Computation,” Addison-Wesley, Reading, MA. Hopfield, J. J., and Tank, D. W. (1986). “Computing with neural circuits:

10:32

645 A model,” Science 233 (4764), 625–633. Hornik, K., Stinchcombe, M., and White, H. (1989). “Multilayer feedforward networks are universal approximators,” Neural Networks 2 (5) 359–366. Kohonen, T. (1988). “Self-Organization and Associative Memory,” Springer-Verlag, Berlin. Li, E. Y. (1994). “Artificial neural networks and their business applications,” Infor. Management, 27 (5), 303–313. Medsker, L., and Liebowitz, J. (1994). “Design and Development of Expert Systems and Neural Networks,” Macmillan, New York. Mehra, P., and Wah, B. W. (19xx). “Artificial Neural Networks: Concepts and Theory,” IEEE, New York. Moody, J., and Darken, C. J. (1989). “Fast learning in networks of locallytuned processing elements,” Neural Comput. 1 (2), 281–294. Smith, M. (1993). “Neural Networks for Statistical Modeling,” Van Nostrand Reinhold, New York. Specht, D. F. (1991). “A general regression neural network,” IEEE Trans. Neural Networks 2 (6), 568–576. White, H. (1990). “Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings,” Neural Networks 3 (5), 535–549. Widrow, B., Rumelhart, D. E., and Lehr, M. A. (1994). “Neural networks: Applications in industry, business and science,” Commun. ACM 37 (3), 93–105.