With the rise of social media such as blogs and social networks, reviews, ratings and recommendations are rapidly proliferating; being able to automatically filter them is a current key challenge for businesses looking to sell their wares and identify new market opportunities
1. Introduction
With the rise of social media such as blogs and social networks, reviews, ratings and recommendations are rapidly proliferating; being able to automatically filter them is a current key challenge for businesses looking to sell their wares and identify new market opportunities. This has created a surge of research in sentiment classification (or sentiment analysis), which aims to determine the judgment of a writer with respect to a given topic based on a given textual comment. Sentiment analysis is now a mature machine learning research topic, as illustrated with this review (Pang and Lee, 2008). Applications to many different domains have been presented, ranging from movie reviews (Pang et al., 2002) and congressional floor debates (Thomas et al., 2006) to product recommendations (Snyder and Barzilay, 2007; Blitzer et al., 2007). This large variety of data sources makes it difficult and costly to design a robust sentiment classifier. Indeed, reviews deal with various kinds of products or services for which vocabularies are different. For instance, consider the simple case of training a system analyzing reviews about only two sorts of products: kitchen appliances and DVDs. One set of reviews would contain adjectives such as “malfunctioning”, “reliable” or “sturdy”, and the other “thrilling”, “horrific” or “hilarious”, etc. Therefore, data distributions are different across domains. One solution could be to learn a different system for each domain. However, this would imply a huge cost to annotate training data for a large number of domains and prevent us from exploiting the information shared across domains. An alternative strategy, evaluated here, consists in learning a single system from the set of domains for which labeled and unlabeled data are available and then apply it to any target domain (labeled or unlabeled). This only makes sense if the system is able to discover intermediate abstractions that are shared and meaningful across domains. This problem of training and testing models on different distributions is known as domain adaptation (Daum´e III and Marcu, 2006). In this paper, we propose a Deep Learning approach for the problem of domain adaptation of sentiment classifiers. The promising new area of Deep Learning has emerged recently; see (Bengio, 2009) for a review. Deep Learning is based on algorithms for discovering intermediate representations built in a hierarchical manner. Deep Learning relies on the discovery that unsupervised learning could be used to set each level of a hierarchy of features, one level at a time, based on the features discovered at the previous level. These features have successfully been used to initialize deep neural networks (Hinton and Salakhutdinov, 2006; Hinton et al., 2006; Bengio et al., 2006). Imagine a probabilistic graphical model in which we introduce latent variables which correspond to the true explanatory factors of the observed data. It is likely that answering questions and learning dependencies in the space of these latent variables would be easier than answering questions about the raw input. A simple linear classifier or non-parametric predictor trained from as few as one or a few examples might be able to do the job. The key to achieving this is learning better representations, mostly from unlabeled data: how this is done is what differentiates Deep Learning algorithms. The Deep Learning system we introduce in Section 3 is designed to use unlabeled data to extract high-level features from reviews. We show in Section 4 that sentiment classifiers trained with these learnt features can: (i) surpass state-of-the-art performance on a benchmark of 4 kinds of products and (ii) successfully perform domain adaptation on a large-scale data set of 22 domains, beating all of the baselines we tried.
2. Domain Adaptation
Domain adaptation considers the setting in which the training and testing data are sampled from different distributions. Assume we have two sets of data: a source domain S providing labeled training instances and a target domain T providing instances on which the classifier is meant to be deployed. We do not make the assumption that these are drawn from the same distribution, but rather that S is drawn from a distribution pS and T from a distribution pT . The learning problem consists in finding a function realizing a good transfer from S to T i.e. it is trained on data drawn from pS and generalizes well on data drawn from pT . Deep Learning algorithms learns intermediate concepts between raw input and target. Our intuition for using it in this setting is that these intermediate concepts could yield better transfer across domains. Suppose for example that these intermediate concepts indirectly capture things like product quality, product price, customer service, etc. Some of these concepts are general enough to make sense across a wide range of domains (corresponding to products or services, in the case of sentiment analysis). Because the same words or tuples of words may be used across domains to indicate the presence of these higher-level concepts, it should be possible to discover them. Furthermore, because Deep Learning exploits unsupervised learning to discover these concepts, one can exploit the large amounts of unlabeled data across all domains to learn these intermediate representations. Here, as in many other Deep Learning approaches, we do not engineer what these intermediate concepts should be, but instead use generic learning algorithms to discover them.
2.1. Related Work
Learning setups relating to domain adaptation have been proposed before and published under different names. Daum´e III and Marcu (2006) formalized the problem and proposed an approach based on a mixture model. A general way to address domain adaptation is through instance weighting, in which instancedependent weights are added to the loss function (Jiang and Zhai, 2007). Another solution to domain adaptation can be to transform the data representations of the source and target domains so that they present the same joint distribution of observations and labels. Ben-David et al. (2007) formally analyze the effect of representation change for domain adaptation while Blitzer et al. (2006) propose the Structural Correspondence Learning (SCL) algorithm that makes use of the unlabeled data from the target domain to find a low-rank joint representation of the data. Finally, domain adaptation can be simply treated as a standard semi-supervised problem by ignoring the domain difference and considering the source instances as labeled data and the target ones as unlabeled data (Dai et al., 2007). In that case, the framework is very close to that of self-taught learning (Raina et al., 2007), in which one learns from labeled examples of some categories as well as unlabeled examples from a larger set of categories. The approach of Raina et al. (2007) relies crucially on the unsupervised learning of a representation, like the approach proposed here.
3. Deep Learning Approach
3.1. Background
If Deep Learning algorithms are able to capture, to some extent, the underlying generative factors that explain the variations in the input data, what is really needed to exploit that ability is for the learned representations to help in disentangling the underlying factors of variation. The simplest and most useful way this could happen is if some of the features learned (the individual elements of the learned representation) are mostly related to only some of these factors, perhaps only one. Conversely, it would mean that such features would have invariant properties, i.e., they would be highly specific in their response to a subset (maybe only one) of these factors of variation and insensitive to the others. This hypothesis was tested by Goodfellow et al. (2009), for images and geometric invariances associated with movements of the camera.
It is interesting to evaluate Deep Learning algorithms on sentiment analysis for several reasons. First, if they can extract features that somewhat disentangle the underlying factors of variation, this would likely help to perform transfer across domains, since we expect that there exist generic concepts that characterize product reviews across many domains. Second, for our Amazon datasets, we know some of these factors (such as whether or not a review is about a particular product, or is a positive appraisal for that product), so we can use this knowledge to quantitatively check to what extent they are disentangled in the learned representation: domain adaptation for sentiment analysis becomes a medium for better understanding deep architectures. Finally, even though Deep Learning algorithms have not yet been evaluated for domain adaptation of sentiment classifiers, several very interesting results have been reported on other tasks involving textual data, beating the previous state-of-the-art in several cases (Salakhutdinov and Hinton, 2007; Collobert and Weston, 2008; Ranzato and Szummer, 2008).
3.2. Stacked Denoising
Auto-encoders The basic framework for our models is the Stacked Denoising Auto-encoder (Vincent et al., 2008). An auto-encoder is comprised of an encoder function h(·) and a decoder function g(·), typically with the dimension of h(·) smaller than that of its argument. The reconstruction of input x is given by r(x) = g(h(x)), and auto-encoders are typically trained to minimize a form of reconstruction error loss(x, r(x)). Examples of reconstruction error include the squared error, or like here, when the elements of x or r(x) can be considered as probabilities of a discrete event the Kullback-Liebler divergence between elements of x and elements of r(x). When the encoder and decoder are linear and the reconstruction error is quadratic, one recovers in h(x) the space of the principal components (PCA) of x. Once an auto-encoder has been trained, one can stack another auto-encoder on top of it, by training a second one which sees the encoded output of the first one as its training data. Stacked auto-encoders were one of the first methods for building deep architectures (Bengio et al., 2006), along with Restricted Boltzmann Machines (RBMs) (Hinton et al., 2006). Once a stack of auto-encoders or RBMs has been trained, their parameters describe multiple levels of representation for x and can be used to initialize a supervised deep neural network (Bengio, 2009) or directly feed a classifier, as we do in this paper.
4. Conclusion
This paper has demonstrated that a Deep Learning system based on Stacked Denoising Auto-Encoders with sparse rectifier units can perform an unsupervised feature extraction which is highly beneficial for the domain adaptation of sentiment classifiers. Indeed, our experiments have shown that linear classifiers trained with this higher-level learnt feature representation of reviews outperform the current state-of-the-art. Furthermore, we have been able to successfully perform domain adaptation on an industrial-scale dataset of 22 domains, where we significantly improve generalization over the baseline and over a similarly structured but purely supervised alternative.
the source: http://www.icml-2011.org/papers/342_icmlpaper.pdf