Introduction
The sudden ejection of activity in the field of opinion mining and sentiment analysis, which manages the computational treatment of opinion, sentiment and subjectivity in a text, has consequently happened at least partially as an immediate reaction to the surge of enthusiasm for new frameworks that deal specifically with sentiments as a top of the lime question. This task involves classifying opinions in text into categories like "positive" or "negative", it is also called opinion mining or voice of the customer. Sentiment analysis is used for several applications, particularly in business intelligence, a few cases of utilization for sentiment analysis include:
- Analysing social media content.
- Evaluating Movies/Product reviews...
In this tutorial, we will focus on the last application. By the end of it, you will:
- Understand how sentiment analysis works.
- clean and pre-process the dataset.
- Leverage some machine learning/deep learning models to analyze the sentiment of texts.
- Visualize the results...
Dataset
For this tutorial, we are going to use python and further libraries to analyze the sentiment IMDB movie reviews, we are going to use a pre-constructed annotated dataset that contains 25 000 rows. the dataset can be downloaded from this link link .
OK, Let's take a look at our data by defining a function that loads the training data and extracts the two columns we need: Sentiment and Text.
import pandas as pd
def ingest_train():
data = pd.read_csv('~/dataset.csv')
data = data[data.Sentiment.isnull() == False]
data['Sentiment'] = data['Sentiment'].map(int)
data = data[data['SentimentText'].isnull() == False]
data.reset_index(inplace=True)
data.drop('index', axis=1, inplace=True)
return data
train = ingest_train()
train.head()
Data Preparation
Now, It looks like it’s time for some cleaning!
Let's first define the data cleaning function, a then apply it to the whole dataset. This function removes URL, remove HTML tags, handle negation words which are split into two parts, convert the words to lower cases, remove all non-letter characters. These elements are very common and they do not provide enough semantic information for the task.
import re
pat_1 = r"(?:\@|https?\://)\S+"
pat_2 = r'#\w+ ?'
combined_pat = r'|'.join((pat_1, pat_2))
www_pat = r'www.[^ ]+'
html_tag = r'<[^>]+>'
negations_ = {"isn't":"is not", "can't":"can not","couldn't":"could not", "hasn't":"has not",
"hadn't":"had not","won't":"will not",
"wouldn't":"would not","aren't":"are not",
"haven't":"have not", "doesn't":"does not","didn't":"did not",
"don't":"do not","shouldn't":"should not","wasn't":"was not", "weren't":"were not",
"mightn't":"might not",
"mustn't":"must not"}
negation_pattern = re.compile(r'\b(' + '|'.join(negations_.keys()) + r')\b')
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
def data_cleaner(text):
try:
stripped = re.sub(combined_pat, '', text)
stripped = re.sub(www_pat, '', stripped)
cleantags = re.sub(html_tag, '', stripped)
lower_case = cleantags.lower()
neg_handled = negation_pattern.sub(lambda x: negations_[x.group()], lower_case)
letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
tokens = tokenizer.tokenize(letters_only)
return (" ".join(tokens)).strip()
except:
return 'NC'
The results of this should give us a cleaned dataset and remove lines with 'NC'.
Next, let's define a handy function to monitor DataFrame creations, then look at our cleaned data.
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
def post_process(data, n=1000000):
data = data.head(n)
data['SentimentText'] = data['SentimentText'].progress_map(data_cleaner)
data.reset_index(inplace=True)
data.drop('index', inplace=True, axis=1)
return data
train = post_process(train)
After that, we can save the cleaned data as csv.
clean_data = pd.DataFrame(train,columns=['SentimentText'])
clean_data['Sentiment'] = train.Sentiment
clean_data.to_csv('~/clean_data.csv',encoding='utf-8')
csv = '~/clean_data.csv'
data = pd.read_csv(csv,index_col=0)
data.head()
Data visualization
Before proceeding to the classification step, let's do some visualization of our textual data. the words cloud is the best choice for this matter, it is a visual representation of text data. It displays a list of words, the importance of each being shown with font size or color. This format is useful for quickly perceiving the most prominent terms.
For this data viz, we use the python library wordcloud.
Let's begin with the word cloud of negative terms.
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from wordcloud import WordCloud, STOPWORDS
neg_tweets = train[train.Sentiment == 0]
neg_string = []
for t in neg_tweets.SentimentText:
neg_string.append(t)
neg_string = pd.Series(neg_string).str.cat(sep=' ')
from wordcloud import WordCloud
wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(neg_string)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
The world cloud for the positive terms.
pos_tweets = train[train.Sentiment == 1]
pos_string = []
for t in pos_tweets.SentimentText:
pos_string.append(t)
pos_string = pd.Series(pos_string).str.cat(sep=' ')
wordcloud = WordCloud(width=1600, height=800,max_font_size=200,colormap='magma').generate(pos_string)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
We can see some neutral words in big size, “film”, ”movie”, "character", but words like “good”, “love”, “great” are relevant for positive words and "bad', "little" for negative words.
Building the models
Before proceeding to the training phases, let's split our data into training and validation set.
#Spliting The Data
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation, y_train, y_validation = train_test_split(train.SentimentText, train.Sentiment, test_size=.2, random_state=SEED)
Features Extraction
In this part, we will use a feature extraction technique called Tfidf vectorizer of 100,000 features including up to trigram. This technique is a way to convert textual data to the numeric form.
The below model_comparator function, we will use a custom function acc_summary, which reports validation accuracy, and the time it took to train and evaluate.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
from time import time
def acc_summary(pipeline, x_train, y_train, x_test, y_test):
t0 = time()
sentiment_fit = pipeline.fit(x_train, y_train)
y_pred = sentiment_fit.predict(x_test)
train_test_time = time() - t0
accuracy = accuracy_score(y_test, y_pred)
print "accuracy score: {0:.2f}%".format(accuracy*100)
print "train and test time: {0:.2f}s".format(train_test_time)
print "-"*80
return accuracy, train_test_time
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer()
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import NearestCentroid
from sklearn.feature_selection import SelectFromModel
names = ["Logistic Regression", "Linear SVC", "LinearSVC with L1-based feature selection","Multinomial NB",
"Bernoulli NB", "Ridge Classifier", "AdaBoost", "Perceptron","Passive-Aggresive", "Nearest Centroid"]
classifiers = [
LogisticRegression(),
LinearSVC(),
Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
('classification', LinearSVC(penalty="l2"))]),
MultinomialNB(),
BernoulliNB(),
RidgeClassifier(),
AdaBoostClassifier(),
Perceptron(),
PassiveAggressiveClassifier(),
NearestCentroid()
]
zipped_clf = zip(names,classifiers)
tvec = TfidfVectorizer()
def classifier_comparator(vectorizer=tvec, n_features=10000, stop_words=None, ngram_range=(1, 1), classifier=zipped_clf):
result = []
vectorizer.set_params(stop_words=stop_words, max_features=n_features, ngram_range=ngram_range)
for n,c in classifier:
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', c)
])
print "Validation result for {}".format(n)
print c
clf_acc,tt_time = acc_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
result.append((n,clf_acc,tt_time))
return result
trigram_result = classifier_comparator(n_features=100000,ngram_range=(1,3))
A summary of the results for comparison is given below.
Classifier | Accuracy | Train and test time |
---|---|---|
Logistic regression | 89.36% | 48.01s |
Linear SVC | 90.48% | 130.99s |
LinearSVC with L1-based feature selection | 89.68% | 68.35s |
Multinomial NB | 87.86% | 48.57s |
Bernoulli NB | 87.82% | 66.07s |
Ridge Classifier | 90.54% | 46.96s |
AdaBoost | 80.72% | 66.22s |
Perceptron | 89.02% | 44.78s |
Passive-Aggressive | 90.12% | 45.59s |
Nearest Centroid | 81.50% | 52.60s |
Thus, It looks like Ridge Classifier and Linear SVC are the best performing classifier in our case.
That's it for this part, we will try in the next post to implement Word2vec and Doc2Vec to see if there is any improvement in the performance.
Thank you for reading :)
Oumaima Hourrane
PhD Student at Faculty of science Ben M'Sik Casablanca