This tutorial is the second part of sentiment analysis task, we are going to the comparison of word2vec model and doc2vec, so before jumping into this, let's give some brief introduction about those two techniques.
Word2vec
Word2vec are estimations of word representations in vector space developed by Mikolov & Al. It provides an efficient implementation of the continuous bag of words and skip-gram models for computing vector representations of words. Those are the two main learning algorithms for distributed representations of words whose aim is to minimize computational complexity.
The Continuous Bag of Words CBOW
Where the non-linear hidden layer is removed and the projection layer is shared for all words. This model predicts the current word based on the N words both before and after it. E.g. Given N=2, the model is as the figure 1 showed. And by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word.
The Skip-gram model
Which is similar to CBOW, but instead of predicting the word from context, it tries to maximize the classification of a word based on another word in the same sentence. The Skip-gram architecture works a little less well on the syntax task than on the CBOW model, but much better on the semantic part of the test than all the other models.
Doc2vec
Doc2vec is an extended model that goes beyond word level to achieve document-level representations. This model represents one of the skip-gram techniques previously presented, in order to remove the limitations of the vector representations of the words, correspond to the composition of the meaning of each of its individual words.
Distributed Memory - DM
This model is analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a center word based on context words and a context paragraph. Mikolove & Al. have implemented DM model in two different ways, using average calculation process, or concatenating calculation method. (DMC and DMM)
Distributed Bag of Words - DBOW
This model is analogous to Skip-gram model in Word2Vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.
In our tutorial, in order to bet the vectors of each row, we will implement and compare using a Python library Gensim, the models below:
- Word2vec
- DBOW (Distributed Bag of Words)
- DMC (Distributed Memory Concatenated)
- DMM (Distributed Memory Mean)
- DBOW + DMC
- DBOW + DMM
Those representations take our dataset as input and produce the word vectors as output. They first construct a vocabulary from the training text data and then learn vector representation of words. The resulting vectors can be used as features in the next step for sentiment analysis where we use simple Neural Network for training and evaluated the result on the validation set.
Let's begin :)
Firstly we load the cleaned data (see the previous part here), then split our data into training and validation set.
import os
import sys
import gensim
import pandas as pd
from gensim.models.doc2vec import LabeledSentence
csv = '~/clean_data.csv'
data = pd.read_csv(csv,index_col=0)
data.head()
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation, y_train, y_validation = train_test_split(data.SentimentText, data.Sentiment, test_size=.2, random_state=SEED)
Next, we can label each text with unique ID using Gensim’s LabeledSentence function as shown below, and then concatenate the training and validation set for word representation, that's because word2vec and doc2vec training are completely unsupervised and thus there is no need to hold out any data, as it is unlabelled.
def labelize_text(text,label):
result = []
prefix = label
for i, t in zip(text.index, text):
result.append(LabeledSentence(t.split(), [prefix + '_%s' % i]))
return result
all_x = pd.concat([x_train,x_validation])
all_x_w2v = labelize_text(all_x, 'ALL')
x_train = labelize_text(x_train, 'TRAIN')
x_validation = labelize_text(x_validation, 'TEST')
Now let's train our first model word2vec from our corpus, we put the size the output vector to 200.
from gensim.models.word2vec import Word2Vec
from tqdm import tqdm
from sklearn import utils
import numpy as np
model_w2v = Word2Vec(size=200, min_count=10)
model_w2v.build_vocab([x.words for x in tqdm(all_x_w2v)])
model_w2v.train([x.words for x in tqdm(all_x_w2v)], total_examples=len(all_x_w2v), epochs=1)
After training our model, we can use it now to convert words to vectors like the example below.
model_w2v['good']
We can use the result of the training to extract the similarities of a given word as well.
model_w2v.most_similar('good')
Data Viz
We can also be projected our vocabulary in a vector space model which represent embed words in a continuous vector space where semantically similar words are mapped to nearby points. We have visualized the learned vectors by projecting them down into 2 dimensions by using the t-SNE dimensionality reduction technique and using an interactive visualization tool called Bokeh.
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook
output_notebook()
plot_tfidf = bp.figure(plot_width=700, plot_height=600, title="A map of 10000 word vectors",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
word_vec = [model_w2v[w] for w in model_w2v.wv.vocab.keys()[:5000]]
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
tsne_w2v = tsne_model.fit_transform(word_vec)
tsne_df = pd.DataFrame(tsne_w2v, columns=['x', 'y'])
tsne_df['words'] = model_w2v.wv.vocab.keys()[:5000]
plot_tfidf.scatter(x='x', y='y', source=tsne_df)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"word": "@words"}
show(plot_tfidf)
When we inspect these visualizations (Zoom in Zoom out) it becomes apparent that the vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another. It was very interesting when we first discovered that certain directions in the induced vector space specialize towards some semantic relationships.
Sentiment classification¶
In order to classify the sentiments of the reviews in our data, we have to turn them also into vectors. The simplest way to represent a sentence is to consider it as the sum of all words without regarding word orders. Yet, in our tutorial, we utilize Vector weighted average of words with their TF-IDF where each weight gives the importance of the word with respect to the corpus, and decrease the influence of the most common words. According to Kenter and Al., averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of NLP tasks.
Let's build our a Tf-IDF matrix, then define the function that creates an averaged review vector.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x.words for x in all_x_w2v])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
def build_Word_Vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1, size)) * tfidf[word]
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec
We can now convert our training and validation set into a list of vectors using the previous function. We also scale each column to have zero mean and unit standard deviation. After that, we feed our neural network with the resulted vectors, this network composes of three hidden layers, each with 256 nodes. Then, after the training, we will evaluate it on the validation set.
from sklearn.preprocessing import scale
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
train_vecs_w2v = np.concatenate([build_Word_Vector(z, 200) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_w2v = scale(train_vecs_w2v)
val_vecs_w2v = np.concatenate([build_Word_Vector(z, 200) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_w2v = scale(val_vecs_w2v)
model = Sequential()
model.add(Dense(256, activation='relu', input_dim=200))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(train_vecs_w2v, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_w2v, y_validation, batch_size=128, verbose=2)
print score[1]
The accuracy of the model using word2vec is 72.42%, and this is quiet good, let's compare it with other Doc2vec models.
Distributed Bag Of Words - DBOW
In the Doc2vec model, a word vector W is generated for each word, and a document vector D is generated for each document. The model also trains weights for a softmax hidden layer. we will use next the same steps used in the previous word2vec model, Training, averaging the vectors, feeding the neural network...
from gensim.models import Doc2Vec
import multiprocessing
cores = multiprocessing.cpu_count()
model_dbow = Doc2Vec(dm=0, size=100, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_x_w2v)])
model_dbow.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
def build_doc_Vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_dbow[word].reshape((1, size)) * tfidf[word]
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec
train_vecs_dbow = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_dbow = scale(train_vecs_dbow)
val_vecs_dbow = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_dbow = scale(val_vecs_dbow)
model = Sequential()
model.add(Dense(256, activation='relu', input_dim=100))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(train_vecs_dbow, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dbow, y_validation, batch_size=128, verbose=2)
print score[1]
Unfortunately, the accuracy on the validation set turned out to be 61.72%, which is a bit disappointing. But this model actually faster and consumes less memory, since there is no need to save the word vectors.
Distributed Memory Concatenation - DMC¶
Now we move to the Distributed Memory model, we will first try with concatenation method for training.
cores = multiprocessing.cpu_count()
model_dmc = Doc2Vec(dm=1, dm_concat=1, size=100, window=2, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_dmc.build_vocab([x for x in tqdm(all_x_w2v)])
model_dmc.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
def build_doc_Vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_dmc[word].reshape((1, size)) * tfidf[word]
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec
train_vecs_dmc = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_dmc = scale(train_vecs_dmc)
val_vecs_dmc = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_dmc = scale(val_vecs_dmc)
model = Sequential()
model.add(Dense(256, activation='relu', input_dim=100))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(train_vecs_dmc, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dmc, y_validation, batch_size=128, verbose=2)
print score[1]
The accuracy tested on the validation set with the 3 layers neural network is 79.44%. It seems like it’s doing its job.
Distributed Memory Mean - DMM¶
we can try another method of training DM model.
cores = multiprocessing.cpu_count()
model_dmm = Doc2Vec(dm=1, dm_mean=1, size=100, window=4, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_dmm.build_vocab([x for x in tqdm(all_x_w2v)])
model_dmm.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)
def build_doc_Vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_dmm[word].reshape((1, size)) * tfidf[word]
count += 1.
except KeyError:
continue
if count != 0:
vec /= count
return vec
train_vecs_dmm = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_dmm = scale(train_vecs_dmm)
val_vecs_dmm = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_dmm = scale(val_vecs_dmm)
model = Sequential()
model.add(Dense(256, activation='relu', input_dim=100))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(train_vecs_dmm, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dmm, y_validation, batch_size=128, verbose=2)
print score[1]
The validation set accuracy is 80.24%, which is much better than DMC model and DBOW model.
Combined Model DBOW + DMC¶
In this part, we can concatenate the previous doc2vec models to see how it affects the performance. see we define a simple function to concatenate document vectors from different models as shown below.
def get_concat_vectors(model1,model2, corpus, size):
vecs = np.zeros((len(corpus), size))
n = 0
for i in corpus.index:
prefix = 'all_' + str(i)
vecs[n] = np.append(model1.docvecs[prefix],model2.docvecs[prefix])
n += 1
return vecs
train_vecs_dbow_dmc = get_concat_vectors(model_dbow,model_dmc, x_train, 200)
val_vecs_dbow_dmc = get_concat_vectors(model_dbow,model_dmc, x_validation, 200)
model = Sequential()
model.add(Dense(256, activation='relu', input_dim=200))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(train_vecs_dbow_dmc, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dbow_dmc, y_validation, batch_size=128, verbose=2)
print score[1]
The accuracy for DBOW + DMC model is 79.56%, which has improved from pure DBOW model and DMC model. Let’s try the combination DBOW and DMM.
train_vecs_dbow_dmm = get_concat_vectors(model_ug_dbow,model_ug_dmm, x_train, 200)
val_vecs_dbow_dmm = get_concat_vectors(model_ug_dbow,model_ug_dmm, x_validation, 200)
model = Sequential()
model.add(Dense(256, activation='relu', input_dim=200))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(train_vecs_dbow_dmm, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dbow_dmm, y_validation, batch_size=128, verbose=2)
print score[1]
The accuracy of this combination is 81.66, which is far better than all the previous models.
A summary of the results of this tutorial is given below.
Model | Accuracy |
---|---|
Word2vec | 72.42% |
Dbow | 61.72% |
DMC | 79.44% |
DMM | 80.24% |
DBOW + DMC | 79.56% |
DBOW + DMM | 81.66% |
In the next part, we will implement and compare the Convolutional Neural Network and LSTM models for our task sentiment analysis.
Thanks for your reading :)
Oumaima Hourrane
PhD Student at Faculty of Science Ben M'Sik Casablanca