### 1. Introduction

The goal of machine learning is to create systems that can improve their performance at some task as they acquire experience or data. In many natural learning tasks, this experience or data is gained interactively, by taking actions, making queries, or doing experiments. Most machine learning research, however, treats the learner as a passive recipient of data to be processed. This \passive" approach ignores the fact that, in many situations, the learner's most powerful tool is its ability to act, to gather data, and to in uence the world it is trying to understand. Active learning is the study of how to use this ability evectively. Formally, active learning studies the closed-loop phenomenon of a learner selecting actions or making queries that in uence what data are added to its training set. Examples include selecting joint angles or torques to learn the kinematics or dynamics of a robot arm, selecting locations for sensor measurements to identify and locate buried hazardous wastes, or querying a human expert to classify an unknown word in a natural language understanding problem. When actions/queries are selected properly, the data requirements for some problems decrease drastically, and some NP-complete learning problems become polynomial in computation time (Angluin, 1988; Baum & Lang, 1991). In practice, active learning oers its greatest rewards in situations where data are expensive or dicult to obtain, or when the environment is complex or dangerous. In industrial settings each training point may take days to gather and cost thousands of dollars; a method for optimally selecting these points could oer enormous savings in time and money.

There are a number of different goals which one may wish to achieve using active learning. One is optimization, where the learner performs experiments to nd a set of inputs that maximize some response variable. An example of the optimization problem would be nding the operating parameters that maximize the output of a steel mill or candy factory. There is an extensive literature on optimization, examining both cases where the learner has some prior knowledge of the parameterized functional form and cases where the learner has no such knowledge; the latter case is generally of greater interest to machine learning practitioners. The favored technique for this kind of optimization is usually a form of response surface methodology (Box & Draper, 1987), which performs experiments that guide hill-climbing through the input space. A related problem exists in the eld of adaptive control, where one must learn a control policy by taking actions. In control problems, one faces the complication that the value of a specic action may not be known until many time steps after it is taken. Also, in control (as in optimization), one is usually concerned with the performing well during the learning task and must trade of exploitation of the current policy for exploration which may improve it. The subeld of dual control (Fe'ldbaum, 1965) is specically concerned with nding an optimal balance of exploration and control while learning.

In active learning situations, the learner itself is responsible for acquiring the training set. Here, we assume it can iteratively select a new input ~x (possibly from a constrained set), observe the resulting output ~y, and incorporate the new example (~x; y~) into its training set. This contrasts with related work by Plutowski and White (1993), which is concerned with ltering an existing data set. In our case, ~x may be thought of as a query, experiment, or action, depending on the research eld and problem domain. The question we will be concerned with is how to choose which ~x to try next. There are many heuristics for choosing ~x, including choosing places where we don't have data (Whitehead, 1991), where we perform poorly (Linden & Weber, 1993), where we have low condence (Thrun & Moller, 1992), where we expect it to change our model (Cohn, Atlas, & Ladner, 1990, 1994), and where we previously found data that resulted in learning (Schmidhuber & Storck, 1993). In this paper we will consider how one may select ~x in a statistically \optimal" manner for some classes of machine learning algorithms. We rst brie y review how the statistical approach can be applied to neural networks, as described in earlier work (MacKay, 1992; Cohn, 1994). Then, in Sections 3 and 4 we consider two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. Section 5 presents the empirical results of applying statistically-based active learning to these architectures. While optimal data selection for a neural network is computationally expensive and approximate, we nd that optimal data selection for the two statistical models is ecient and accurate.