Much recent work in machine learning has focused on learning good feature representations from unlabeled input data for higher-level tasks such as classification.

1 Introduction

Much recent work in machine learning has focused on learning good feature representations from unlabeled input data for higher-level tasks such as classification. Current solutions typically learn multi-level representations by greedily “pre-training” several layers of features, one layer at a time, using an unsupervised learning algorithm [11, 8, 18]. For each of these layers a number of design parameters are chosen: the number of features to learn, the locations where these features will be computed, and how to encode the inputs and outputs of the system. In this paper we study the effect of these choices on single-layer networks trained by several feature learning methods. Our results demonstrate that several key ingredients, orthogonal to the learning algorithm itself, can have a large impact on performance: whitening, large numbers of features, and dense feature extraction can all be major advantages. Even with very simple algorithms and a single layer of features, it is possible to achieve state-ofthe-art performance by focusing effort on these choices rather than on the learning system itself. A major drawback of many feature learning systems is their complexity and expense. In addition, many algorithms require careful selection of multiple hyperparameters like learning rates, momentum, sparsity penalties, weight decay, and so on that must be chosen through cross-validation, thus increasing running times dramatically. Though it is true that recently introduced algorithms have consistently shown improvements on benchmark datasets like NORB [16] and CIFAR-10 [13], there are several other factors that affect the final performance of a feature learning system. Specifically, there are many “meta-parameters” defining the network architecture, such as the receptive field size and number of hidden nodes (features). In practice, these parameters are often determined by computational constraints. For instance, we might use the largest number of features possible considering the running time of the algorithm. In this paper, however, we pursue an alternative strategy: we employ very simple learning algorithms and then more carefully choose the network parameters in search of higher performance. If (as is often the case) larger representations perform better, then we can leverage the speed and simplicity of these learning algorithms to use larger representations. To this end, we will begin in Section 3 by describing a simple feature learning framework that incorporates an unsupervised learning algorithm as a “black box” module within. For this “black box”, we have implemented several off-the-shelf unsupervised learning algorithms: sparse auto-encoders, sparse RBMs, Kmeans clustering, and Gaussian mixture models. We then analyze the performance impact of several different elements in the feature learning framework, including: (i) whitening, which is a common pre-process in deep learning work, (ii) number of features trained, (iii) step-size (stride) between extracted features, and (iv) receptive field size.

2 Related work

Since the introduction of unsupervised pre-training [8], many new schemes for stacking layers of features to build “deep” representations have been proposed. Most have focused on creating new training algorithms to build single-layer models that are composed to build deeper structures. Among the algorithms considered in the literature are sparse-coding [22, 17, 32], RBMs [8, 13], sparse RBMs [18], sparse autoencoders [7, 25], denoising auto-encoders [30], “factored” [24] and mean-covariance [23] RBMs, as well as many others [19, 33]. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to be the most heavily scrutinized. Some work, however, has considered the impact of other choices in these feature learning systems, especially the choice of network architecture. Jarret et al. , for instance, have considered the impact of changes to the “pooling” strategies frequently employed between layers of features, as well as different forms of normalization and rectification between layers. Similarly, Boureau et al. have considered the impact of coding strategies and different types of pooling, both in practice [3] and in theory [4]. Our work follows in this vein, but considers instead the structure of single-layer networks—before pooling, and orthogonal to the choice of algorithm or coding scheme. Many common threads from the computer vision literature also relate to our work and to feature learning more broadly. For instance, we will use the K-means clustering algorithm as an alternative unsupervised learning module. K-means has been used less widely in “deep learning” work but has enjoyed wide adoption in computer vision for building codebooks of “visual words”  which are used to define higherlevel image features. This method has also been applied recursively to build multiple layers of features .

3 Unsupervised feature learning framework

In this section, we describe a common framework used for feature learning. For concreteness, we will focus on the application of these algorithms to learning features from images, though our approach is applicable to other forms of data as well. The framework we use involves several stages and is similar to those employed in computer vision [5, 15, 31, 28, 1], as well as other feature learning work [16, 19, 3]. At a high-level, our system performs the following steps to learn a feature representation: 1. Extract random patches from unlabeled training images. 2. Apply a pre-processing stage to the patches. 3. Learn a feature-mapping using an unsupervised learning algorithm.

4 Conclusion

In this paper we have conducted extensive experiments on the CIFAR-10 dataset using multiple unsupervised feature learning algorithms to characterize the effect of various parameters on classification performance. While confirming the basic finding that more features and dense extraction are useful, we have shown more importantly that these elements can, in fact, be as important as the unsupervised learning algorithm itself. Surprisingly, we have shown that even the K-means clustering algorithm—an extremely simple learning algorithm with no parameters to tune—is able to achieve state-of-the-art performance on both CIFAR-10 and NORB datasets when used with the network parameters that we have identified in this work. We’ve also shown more generally that smaller stride and larger numbers of features yield monotonically improving performance, which suggests that while more complex algorithms may have greater representational power, simple but fast algorithms can be highly competitive.

 

to read the full article visit :http://proceedings.mlr.press/v15/coates11a/coates11a.pdf