Introduction

The sudden ejection of activity in the field of opinion mining and sentiment analysis, which manages the computational treatment of opinion, sentiment and subjectivity in a text, has consequently happened at least partially as an immediate reaction to the surge of enthusiasm for new frameworks that deal specifically with sentiments as a top of the lime question. This task involves classifying opinions in text into categories like "positive" or "negative", it is also called opinion mining or voice of the customer. Sentiment analysis is used for several applications, particularly in business intelligence, a few cases of utilization for sentiment analysis include:

  • Analysing social media content.
  • Evaluating Movies/Product reviews...

In this tutorial, we will focus on the last application. By the end of it, you will:

  • Understand how sentiment analysis works.
  • clean and pre-process the dataset.
  • Leverage some machine learning/deep learning models to analyze the sentiment of texts.
  • Visualize the results...

Dataset

For this tutorial, we are going to use python and further libraries to analyze the sentiment IMDB movie reviews, we are going to use a pre-constructed annotated dataset that contains 25 000 rows. the dataset can be downloaded from this link link .

OK, Let's take a look at our data by defining a function that loads the training data and extracts the two columns we need: Sentiment and Text.

In [1]:
import pandas as pd
In [2]:
def ingest_train():
    data = pd.read_csv('~/dataset.csv')
    data = data[data.Sentiment.isnull() == False]
    data['Sentiment'] = data['Sentiment'].map(int)
    data = data[data['SentimentText'].isnull() == False]
    data.reset_index(inplace=True)
    data.drop('index', axis=1, inplace=True)
    return data
In [3]:
train = ingest_train()
In [4]:
train.head()
Out[4]:
 
  SentimentText Sentiment
0 first think another Disney movie, might good, ... 1
1 Put aside Dr. House repeat missed, Desperate H... 0
2 big fan Stephen King's work, film made even gr... 1
3 watched horrid thing TV. Needless say one movie... 0
4 truly enjoyed a film. acting terrific plot. Jeff... 1
 

Data Preparation

Now, It looks like it’s time for some cleaning!

Let's first define the data cleaning function, a then apply it to the whole dataset. This function removes URL, remove HTML tags, handle negation words which are split into two parts, convert the words to lower cases, remove all non-letter characters. These elements are very common and they do not provide enough semantic information for the task.

In [5]:
import re

pat_1 = r"(?:\@|https?\://)\S+"
pat_2 = r'#\w+ ?'
combined_pat = r'|'.join((pat_1, pat_2))
www_pat = r'www.[^ ]+'
html_tag = r'<[^>]+>'
negations_ = {"isn't":"is not", "can't":"can not","couldn't":"could not", "hasn't":"has not",
                "hadn't":"had not","won't":"will not",
                "wouldn't":"would not","aren't":"are not",
                "haven't":"have not", "doesn't":"does not","didn't":"did not",
                 "don't":"do not","shouldn't":"should not","wasn't":"was not", "weren't":"were not",
                "mightn't":"might not",
                "mustn't":"must not"}
negation_pattern = re.compile(r'\b(' + '|'.join(negations_.keys()) + r')\b')
In [6]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
In [7]:
def data_cleaner(text):
    try:
        stripped = re.sub(combined_pat, '', text)
        stripped = re.sub(www_pat, '', stripped)
        cleantags = re.sub(html_tag, '', stripped)
        lower_case = cleantags.lower()
        neg_handled = negation_pattern.sub(lambda x: negations_[x.group()], lower_case)
        letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
        tokens = tokenizer.tokenize(letters_only)
        return (" ".join(tokens)).strip()
    except:
        return 'NC'
 

The results of this should give us a cleaned dataset and remove lines with 'NC'.

Next, let's define a handy function to monitor DataFrame creations, then look at our cleaned data.

In [8]:
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

def post_process(data, n=1000000):
    data = data.head(n)
    data['SentimentText'] = data['SentimentText'].progress_map(data_cleaner)  
    data.reset_index(inplace=True)
    data.drop('index', inplace=True, axis=1)
    return data
In [9]:
train = post_process(train)
 
progress-bar: 100%|██████████| 25000/25000 [00:05<00:00, 4699.59it/s]
 

After that, we can save the cleaned data as csv.

In [10]:
clean_data = pd.DataFrame(train,columns=['SentimentText'])
clean_data['Sentiment'] = train.Sentiment

clean_data.to_csv('~/clean_data.csv',encoding='utf-8')

csv = '~/clean_data.csv'
data = pd.read_csv(csv,index_col=0)
data.head()
Out[10]:
  SentimentText Sentiment
0 first think another disney movie might good it... 1
1 put aside dr house repeat missed desperate hou... 0
2 big fan stephen king s work film made even gre... 1
3 watched horrid thing tv needless say one movie... 0
4 truly enjoyed film acting terrific plot jeff c... 1
 

Data visualization

Before proceeding to the classification step, let's do some visualization of our textual data. the words cloud is the best choice for this matter, it is a visual representation of text data. It displays a list of words, the importance of each being shown with font size or color. This format is useful for quickly perceiving the most prominent terms.

For this data viz, we use the python library wordcloud.

Let's begin with the word cloud of negative terms.

In [11]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from wordcloud import WordCloud, STOPWORDS


neg_tweets = train[train.Sentiment == 0]
neg_string = []
for t in neg_tweets.SentimentText:
    neg_string.append(t)
neg_string = pd.Series(neg_string).str.cat(sep=' ')
from wordcloud import WordCloud

wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(neg_string)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
 
 
 

The world cloud for the positive terms.

In [12]:
pos_tweets = train[train.Sentiment == 1]
pos_string = []
for t in pos_tweets.SentimentText:
    pos_string.append(t)
pos_string = pd.Series(pos_string).str.cat(sep=' ')
wordcloud = WordCloud(width=1600, height=800,max_font_size=200,colormap='magma').generate(pos_string) 
plt.figure(figsize=(12,10)) 
plt.imshow(wordcloud, interpolation="bilinear") 
plt.axis("off") 
plt.show()
 
 
 

We can see some neutral words in big size, “film”, ”movie”, "character", but words like “good”, “love”, “great” are relevant for positive words and "bad', "little" for negative words.

 

Building the models

Before proceeding to the training phases, let's split our data into training and validation set.

In [10]:
#Spliting The Data
from sklearn.cross_validation import train_test_split
SEED = 2000

x_train, x_validation, y_train, y_validation = train_test_split(train.SentimentText, train.Sentiment, test_size=.2, random_state=SEED)
 
/home/oumaima/anaconda2/envs/nlp/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
 

Features Extraction

In this part, we will use a feature extraction technique called Tfidf vectorizer of 100,000 features including up to trigram. This technique is a way to convert textual data to the numeric form.

The below model_comparator function, we will use a custom function acc_summary, which reports validation accuracy, and the time it took to train and evaluate.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
import numpy as np
from time import time


def acc_summary(pipeline, x_train, y_train, x_test, y_test):
    t0 = time()
    sentiment_fit = pipeline.fit(x_train, y_train)
    y_pred = sentiment_fit.predict(x_test)
    train_test_time = time() - t0
    accuracy = accuracy_score(y_test, y_pred)
    print "accuracy score: {0:.2f}%".format(accuracy*100)
    print "train and test time: {0:.2f}s".format(train_test_time)
    print "-"*80
    return accuracy, train_test_time


from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer()

from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import NearestCentroid
from sklearn.feature_selection import SelectFromModel

names = ["Logistic Regression", "Linear SVC", "LinearSVC with L1-based feature selection","Multinomial NB", 
         "Bernoulli NB", "Ridge Classifier", "AdaBoost", "Perceptron","Passive-Aggresive", "Nearest Centroid"]
classifiers = [
    LogisticRegression(),
    LinearSVC(),
    Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
  ('classification', LinearSVC(penalty="l2"))]),
    MultinomialNB(),
    BernoulliNB(),
    RidgeClassifier(),
    AdaBoostClassifier(),
    Perceptron(),
    PassiveAggressiveClassifier(),
    NearestCentroid()
    ]
zipped_clf = zip(names,classifiers)

tvec = TfidfVectorizer()
def classifier_comparator(vectorizer=tvec, n_features=10000, stop_words=None, ngram_range=(1, 1), classifier=zipped_clf):
    result = []
    vectorizer.set_params(stop_words=stop_words, max_features=n_features, ngram_range=ngram_range)
    for n,c in classifier:
        checker_pipeline = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', c)
        ])
        print "Validation result for {}".format(n)
        print c
        clf_acc,tt_time = acc_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
        result.append((n,clf_acc,tt_time))
    return result

trigram_result = classifier_comparator(n_features=100000,ngram_range=(1,3))
 
Validation result for Logistic Regression
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
accuracy score: 89.36%
train and test time: 43.08s
--------------------------------------------------------------------------------
Validation result for Linear SVC
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
accuracy score: 90.48%
train and test time: 104.01s
--------------------------------------------------------------------------------
Validation result for LinearSVC with L1-based feature selection
Pipeline(memory=None,
     steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
     verbose=0),
        norm_order=1, prefit...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])
accuracy score: 89.62%
train and test time: 49.88s
--------------------------------------------------------------------------------
Validation result for Multinomial NB
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
accuracy score: 87.86%
train and test time: 47.01s
--------------------------------------------------------------------------------
Validation result for Bernoulli NB
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
accuracy score: 87.82%
train and test time: 45.54s
--------------------------------------------------------------------------------
Validation result for Ridge Classifier
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, random_state=None, solver='auto',
        tol=0.001)
accuracy score: 90.54%
train and test time: 45.28s
--------------------------------------------------------------------------------
Validation result for AdaBoost
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
accuracy score: 80.72%
train and test time: 63.34s
--------------------------------------------------------------------------------
Validation result for Perceptron
Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=False)
 
/home/oumaima/anaconda2/envs/nlp/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.perceptron.Perceptron'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)
 
accuracy score: 89.02%
train and test time: 42.39s
--------------------------------------------------------------------------------
Validation result for Passive-Aggresive
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=None, n_iter=None,
              n_jobs=1, random_state=None, shuffle=True, tol=None,
              verbose=0, warm_start=False)
 
/home/oumaima/anaconda2/envs/nlp/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py:128: FutureWarning: max_iter and tol parameters have been added in <class 'sklearn.linear_model.passive_aggressive.PassiveAggressiveClassifier'> in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.
  "and default tol will be 1e-3." % type(self), FutureWarning)
 
accuracy score: 90.02%
train and test time: 42.19s
--------------------------------------------------------------------------------
Validation result for Nearest Centroid
NearestCentroid(metric='euclidean', shrink_threshold=None)
accuracy score: 81.50%
train and test time: 41.79s
--------------------------------------------------------------------------------
 

A summary of the results for comparison is given below.

Classifier Accuracy Train and test time
Logistic regression 89.36% 48.01s
Linear SVC 90.48% 130.99s
LinearSVC with L1-based feature selection 89.68% 68.35s
Multinomial NB 87.86% 48.57s
Bernoulli NB 87.82% 66.07s
Ridge Classifier 90.54% 46.96s
AdaBoost 80.72% 66.22s
Perceptron 89.02% 44.78s
Passive-Aggressive 90.12% 45.59s
Nearest Centroid 81.50% 52.60s

Thus, It looks like Ridge Classifier and Linear SVC are the best performing classifier in our case.

That's it for this part, we will try in the next post to implement Word2vec and Doc2Vec to see if there is any improvement in the performance.

Thank you for reading :)

 

Oumaima Hourrane

PhD Student at Faculty of science Ben M'Sik Casablanca