### 1 INTRODUCTION

It is well known that a combination of many different predictors can improve predictions. In the neural networks community "ensembles" of neural networks has been investigated by several authors, see for instance. Most often the networks in the ensemble are trained individually and then their predictions are combined. This combination is usually done by majority (in classification) or by simple averaging (in regression), but one can also use a weighted combination of the networks . At the workshop after the last NIPS conference (December, 1993) an entire session was devoted to ensembles of neural networks ( "Putting it all together", chaired by Michael Perrone). Many interesting papers were given, and it showed that this area is getting a lot of attention. A combination of the output of several networks (or other predictors) is only useful if they disagree on some inputs. Clearly, there is no more information to be gained from a million identical networks than there is from just one of them (see also [2]). By quantifying the disagreement in the ensemble it turns out to be possible to state this insight rigorously for an ensemble used for approximation of realvalued functions (regression). The simple and beautiful expression that relates the disagreement (called the ensemble ambiguity) and the generalization error is the basis for this paper, so we will derive it with no further delay.

### 2 THE BIAS-VARIANCE TRADEOFF

The first term on the right is the weighted average of the generalization errors of the individual networks (E = La waEa), and the second is the weighted average of the ambiguities (A = La WaAa), which we refer to as the ensemble ambiguity. The beauty of this equation is that it separates the generalization error into a term that depends on the generalization errors of the individual networks and another term that contain all correlations between the networks. Furthermore, the correlation term A can be estimated entirely from unlabeled data, i. e., no knowledge is required of the real function to be approximated. The term "unlabeled example" is borrowed from classification problems, and in this context it means an input x for which the value of the target function f( x) is unknown. Equation (10) expresses the tradeoff between bias and variance in the ensemble, but in a different way than the the common bias-variance relation [4] in which the averages are over possible training sets instead of ensemble averages. If the ensemble is strongly biased the ambiguity will be small, because the networks implement very similar functions and thus agree on inputs even outside the training set. Therefore the generalization error will be essentially equal to the weighted average of the generalization errors of the individual networks. If, on the other hand , there is a large variance , the ambiguity is high and in this case the generalization error will be smaller than the average generalization error.

### 3 THE CROSS-VALIDATION ENSEMBLE

From(10) it is obvious that increasing the ambiguity (while not increasing individual generalization errors) will improve the overall generalization. We want the networks to disagree! How can we increase the ambiguity of the ensemble? One way is to use different types of approximators like a mixture of neural networks of different topologies or a mixture of completely different types of approximators.

obvious way is to train the networks on different training sets. Furthermore, to be able to estimate the first term in (10) it would be desirable to have some kind of cross-validation. This suggests the following strategy. Chose a number K :::; p. For each network in the ensemble hold out K examples for testing, where the N test sets should have minimal overlap, i. e., the N training sets should be as different as possible. If, for instance, K :::; piN it is possible to choose the K test sets with no overlap. This enables us to estimate the generalization error E(X of the individual members of the ensemble, and at the same time make sure that the ambiguity increases. When holding out examples the generalization errors for the individual members of the ensemble, E(X, will increase, but the conjecture is that for a good choice of the size of the ensemble (N) and the test set size (K), the ambiguity will increase more and thus one will get a decrease in overall generalization error. This conjecture has been tested experimentally on a simple square wave function of one variable shown in Figure 1. Five identical feed-forward networks with one hidden layer of 20 units were trained independently by back-propagation using 200 random examples. For each network a cross-validation set of K examples was held out for testing as described above. The "true" generalization and the ambiguity were estimated from a set of 1000 random inputs. The weights were uniform, w(X = 1/5 (non-uniform weights are addressed later).