This tutorial is the second part of sentiment analysis task, we are going to the comparison of word2vec model and doc2vec, so before jumping into this, let's give some brief introduction about those two techniques.

Word2vec

Word2vec are estimations of word representations in vector space developed by Mikolov & Al. It provides an efficient implementation of the continuous bag of words and skip-gram models for computing vector representations of words. Those are the two main learning algorithms for distributed representations of words whose aim is to minimize computational complexity.

 

The Continuous Bag of Words CBOW

Where the non-linear hidden layer is removed and the projection layer is shared for all words. This model predicts the current word based on the N words both before and after it. E.g. Given N=2, the model is as the figure 1 showed. And by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word.

The Skip-gram model

Which is similar to CBOW, but instead of predicting the word from context, it tries to maximize the classification of a word based on another word in the same sentence. The Skip-gram architecture works a little less well on the syntax task than on the CBOW model, but much better on the semantic part of the test than all the other models.

Doc2vec

Doc2vec is an extended model that goes beyond word level to achieve document-level representations. This model represents one of the skip-gram techniques previously presented, in order to remove the limitations of the vector representations of the words, correspond to the composition of the meaning of each of its individual words.

Distributed Memory - DM

This model is analogous to CBOW model in Word2vec. The paragraph vectors are obtained by training a neural network on the task of inferring a center word based on context words and a context paragraph. Mikolove & Al. have implemented DM model in two different ways, using average calculation process, or concatenating calculation method. (DMC and DMM)

Distributed Bag of Words - DBOW

This model is analogous to Skip-gram model in Word2Vec. The paragraph vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

In our tutorial, in order to bet the vectors of each row, we will implement and compare using a Python library Gensim, the models below:

  • Word2vec
  • DBOW (Distributed Bag of Words)
  • DMC (Distributed Memory Concatenated)
  • DMM (Distributed Memory Mean)
  • DBOW + DMC
  • DBOW + DMM

Those representations take our dataset as input and produce the word vectors as output. They first construct a vocabulary from the training text data and then learn vector representation of words. The resulting vectors can be used as features in the next step for sentiment analysis where we use simple Neural Network for training and evaluated the result on the validation set.

Let's begin :)

Firstly we load the cleaned data (see the previous part here), then split our data into training and validation set.

In [1]:
import os
import sys
import gensim
import pandas as pd
from gensim.models.doc2vec import LabeledSentence
csv = '~/clean_data.csv'
data = pd.read_csv(csv,index_col=0)
data.head()
 
/home/oumaima/.local/lib/python2.7/site-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.22) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
Out[1]:
  SentimentText Sentiment
0 first think another disney movie might good it... 1
1 put aside dr house repeat missed desperate hou... 0
2 big fan stephen king s work film made even gre... 1
3 watched horrid thing tv needless say one movie... 0
4 truly enjoyed film acting terrific plot jeff c... 1
In [2]:
from sklearn.cross_validation import train_test_split
SEED = 2000

x_train, x_validation, y_train, y_validation = train_test_split(data.SentimentText, data.Sentiment, test_size=.2, random_state=SEED)
 
 

Next, we can label each text with unique ID using Gensim’s LabeledSentence function as shown below, and then concatenate the training and validation set for word representation, that's because word2vec and doc2vec training are completely unsupervised and thus there is no need to hold out any data, as it is unlabelled.

In [3]:
def labelize_text(text,label):
    result = []
    prefix = label
    for i, t in zip(text.index, text):
        result.append(LabeledSentence(t.split(), [prefix + '_%s' % i]))
    return result
  
all_x = pd.concat([x_train,x_validation])

all_x_w2v = labelize_text(all_x, 'ALL')
x_train = labelize_text(x_train, 'TRAIN')
x_validation = labelize_text(x_validation, 'TEST')
 
 

Now let's train our first model word2vec from our corpus, we put the size the output vector to 200.

In [4]:
from gensim.models.word2vec import Word2Vec
from tqdm import tqdm
from sklearn import utils
import numpy as np

model_w2v = Word2Vec(size=200, min_count=10)
model_w2v.build_vocab([x.words for x in tqdm(all_x_w2v)])
model_w2v.train([x.words for x in tqdm(all_x_w2v)], total_examples=len(all_x_w2v), epochs=1)
 
100%|██████████| 25000/25000 [00:00<00:00, 1422318.68it/s]
100%|██████████| 25000/25000 [00:00<00:00, 1576094.99it/s]
Out[4]:
(3005282, 3344822)
 

After training our model, we can use it now to convert words to vectors like the example below.

In [5]:
model_w2v['good']
 
Out[5]:
array([-1.20356634e-01,  2.01620758e-01, -1.49855584e-01, -1.40032262e-01,
        1.33756369e-01, -1.46120414e-01,  1.61332116e-02,  3.75851810e-01,
       -3.81415725e-01, -1.04741082e-01,  1.77621603e-01,  2.08926678e-01,
       -2.01129168e-01, -5.88219702e-01, -2.75223494e-01,  2.28985444e-01,
        4.71457034e-01, -2.32448816e-01, -7.47165024e-01,  1.13800988e-01,
       -1.55947626e-01, -1.87482730e-01,  7.93265164e-01, -3.64849031e-01,
        3.63337100e-01,  3.60409796e-01,  3.95269319e-02, -1.90919176e-01,
        1.82287887e-01, -9.35876667e-01, -7.91697577e-02,  1.46627193e-02,
        5.75117290e-01,  5.12551188e-01,  3.68642509e-01,  3.23042035e-01,
       -1.96927309e-01,  9.48887691e-02, -2.07873210e-01, -1.22796237e+00,
       -1.13961753e-03, -6.05956241e-02, -4.02263738e-02,  2.65230179e-01,
        1.41605094e-01,  8.53969678e-02, -1.47209808e-01,  2.22812429e-01,
        6.34986639e-01, -7.47983009e-02, -2.07454100e-01, -6.92057669e-01,
       -1.21512577e-01, -2.98163258e-02, -1.18842252e-01,  1.08654588e-01,
       -1.17953229e+00, -1.78013314e-02, -3.49032879e-01,  2.47875765e-01,
        9.19300437e-01, -2.51706932e-02,  2.98253387e-01, -2.06692412e-01,
        2.39098355e-01,  4.14660394e-01,  1.49376290e-02,  1.99161638e-02,
        9.46666896e-02, -2.44295865e-01,  4.05640274e-01, -1.66489393e-01,
       -1.00734532e+00,  2.12357864e-01,  3.01627457e-01, -7.34570444e-01,
       -3.35306287e-01, -5.84213883e-02,  8.94598007e-01,  6.90925896e-01,
       -7.03696236e-02,  4.94563639e-01, -5.79845190e-01,  3.29751730e-01,
       -4.77250427e-01, -4.38333005e-01,  1.44253075e-01,  2.41361588e-01,
       -4.03879434e-01, -3.68624032e-02, -5.89625299e-01, -7.27992535e-01,
        1.09105863e-01, -6.84686601e-02, -1.13235354e+00,  1.41926037e-04,
        7.32670188e-01, -2.91110814e-01,  7.00395584e-01, -3.52751583e-01,
       -7.85172265e-03,  2.35953540e-01, -4.25253481e-01, -2.87627041e-01,
        1.21491313e+00, -2.05827113e-02, -8.51197958e-01, -7.27003098e-01,
       -5.19853354e-01, -2.25569651e-01,  1.99490264e-01, -7.70654142e-01,
        2.05346152e-01, -3.52195680e-01, -8.41931224e-01,  6.93836868e-01,
       -4.64059502e-01, -3.12139273e-01,  5.14370739e-01,  9.65043545e-01,
        9.32081223e-01,  3.62321973e-01,  7.20789209e-02,  5.82463026e-01,
       -1.88510045e-01, -2.20712021e-01,  2.16216430e-01,  4.91175771e-01,
       -2.81141102e-01, -1.34467721e-01,  1.77699819e-01, -2.13121384e-01,
        5.36062837e-01, -8.24282050e-01, -3.53436202e-01, -4.02124405e-01,
       -4.52920198e-01, -1.71579748e-01, -3.71240944e-01, -4.74526808e-02,
        2.99569786e-01,  1.67528182e-01,  1.84060633e-01,  3.26612294e-01,
        1.27409828e+00,  6.78280830e-01,  1.27044320e-02,  3.81802231e-01,
       -3.38319182e-01,  4.06377822e-01,  6.00214541e-01, -9.16632637e-02,
       -4.34338987e-01, -6.57399833e-01,  8.68327916e-03,  7.25186586e-01,
        3.40244025e-01, -6.44017696e-01,  2.31222928e-01, -7.11220503e-01,
       -6.44924343e-01, -1.98620349e-01, -4.03501421e-01, -6.47172809e-01,
       -6.04047656e-01,  1.18784327e-03, -4.11087126e-01,  4.85154063e-01,
        3.43143463e-01, -1.29660204e-01, -1.14719510e+00,  1.46819621e-01,
       -7.13538647e-01,  4.67051044e-02,  4.81717139e-01,  6.79492712e-01,
        4.21328932e-01,  1.94650739e-01,  3.53093892e-01, -2.64767915e-01,
       -5.88247180e-01,  7.31449008e-01,  5.23238719e-01,  3.84705141e-02,
        9.12800312e-01, -8.49276602e-01,  2.57581752e-02, -5.04338920e-01,
       -2.58153491e-02, -1.44150034e-01,  3.19843054e-01, -5.57117574e-02,
       -1.17708695e+00, -1.28654316e-01,  2.20315352e-01,  1.35768160e-01,
       -5.59179068e-01,  1.27888527e-02,  7.46238708e-01, -1.50875628e-01],
      dtype=float32)
 

We can use the result of the training to extract the similarities of a given word as well.

In [6]:
model_w2v.most_similar('good')
 
/home/oumaima/anaconda2/envs/nlp/lib/python2.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  """Entry point for launching an IPython kernel.
Out[6]:
[('pretty', 0.9664363861083984),
 ('bad', 0.9530471563339233),
 ('overall', 0.9463253617286682),
 ('funny', 0.9402731657028198),
 ('decent', 0.9402391910552979),
 ('terrible', 0.9394140243530273),
 ('awful', 0.9358768463134766),
 ('lulls', 0.9338871240615845),
 ('cheesy', 0.9223247170448303),
 ('horrible', 0.9191421866416931)]
 

Data Viz

We can also be projected our vocabulary in a vector space model which represent embed words in a continuous vector space where semantically similar words are mapped to nearby points. We have visualized the learned vectors by projecting them down into 2 dimensions by using the t-SNE dimensionality reduction technique and using an interactive visualization tool called Bokeh.

In [7]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

output_notebook()
plot_tfidf = bp.figure(plot_width=700, plot_height=600, title="A map of 10000 word vectors",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

word_vec = [model_w2v[w] for w in model_w2v.wv.vocab.keys()[:5000]]

from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
tsne_w2v = tsne_model.fit_transform(word_vec)

tsne_df = pd.DataFrame(tsne_w2v, columns=['x', 'y'])
tsne_df['words'] = model_w2v.wv.vocab.keys()[:5000]

plot_tfidf.scatter(x='x', y='y', source=tsne_df)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"word": "@words"}
show(plot_tfidf)
 
Loading BokehJS ...
 
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 5000 samples in 0.050s...
[t-SNE] Computed neighbors for 5000 samples in 4.111s...
[t-SNE] Computed conditional probabilities for sample 1000 / 5000
[t-SNE] Computed conditional probabilities for sample 2000 / 5000
[t-SNE] Computed conditional probabilities for sample 3000 / 5000
[t-SNE] Computed conditional probabilities for sample 4000 / 5000
[t-SNE] Computed conditional probabilities for sample 5000 / 5000
[t-SNE] Mean sigma: 0.027020
[t-SNE] KL divergence after 250 iterations with early exaggeration: 71.628098
[t-SNE] Error after 1000 iterations: 1.393853
 
 
 

When we inspect these visualizations (Zoom in Zoom out) it becomes apparent that the vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another. It was very interesting when we first discovered that certain directions in the induced vector space specialize towards some semantic relationships.

Sentiment classification

In order to classify the sentiments of the reviews in our data, we have to turn them also into vectors. The simplest way to represent a sentence is to consider it as the sum of all words without regarding word orders. Yet, in our tutorial, we utilize Vector weighted average of words with their TF-IDF where each weight gives the importance of the word with respect to the corpus, and decrease the influence of the most common words. According to Kenter and Al., averaging word embeddings of all words in a text has proven to be a strong baseline or feature across a multitude of NLP tasks.

Let's build our a Tf-IDF matrix, then define the function that creates an averaged review vector.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x.words for x in all_x_w2v])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

def build_Word_Vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model_w2v[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: 
            
            continue
    if count != 0:
        vec /= count
    return vec
 

We can now convert our training and validation set into a list of vectors using the previous function. We also scale each column to have zero mean and unit standard deviation. After that, we feed our neural network with the resulted vectors, this network composes of three hidden layers, each with 256 nodes. Then, after the training, we will evaluate it on the validation set.

In [9]:
from sklearn.preprocessing import scale
import numpy as np
from keras.models import Sequential
from keras.layers import Dense

train_vecs_w2v = np.concatenate([build_Word_Vector(z, 200) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_w2v = scale(train_vecs_w2v)
val_vecs_w2v = np.concatenate([build_Word_Vector(z, 200) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_w2v = scale(val_vecs_w2v)


model = Sequential()
model.add(Dense(256, activation='relu', input_dim=200))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(train_vecs_w2v, y_train, epochs=100, batch_size=32, verbose=2)

score = model.evaluate(val_vecs_w2v, y_validation, batch_size=128, verbose=2)
print score[1]
 
Using TensorFlow backend.
  0%|          | 0/20000 [00:00<?, ?it/s]/home/oumaima/anaconda2/envs/nlp/lib/python2.7/site-packages/ipykernel_launcher.py:12: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  if sys.path[0] == '':
100%|██████████| 20000/20000 [00:58<00:00, 343.65it/s]
100%|██████████| 5000/5000 [00:14<00:00, 355.32it/s]
 
Epoch 1/100
 - 4s - loss: 0.5844 - acc: 0.6947
Epoch 2/100
 - 3s - loss: 0.5455 - acc: 0.7248
Epoch 3/100
 - 3s - loss: 0.5271 - acc: 0.7378
Epoch 4/100
 - 4s - loss: 0.5204 - acc: 0.7440
Epoch 5/100
 - 4s - loss: 0.5108 - acc: 0.7539
Epoch 6/100
 - 3s - loss: 0.5046 - acc: 0.7583
Epoch 7/100
 - 4s - loss: 0.5021 - acc: 0.7594
Epoch 8/100
 - 4s - loss: 0.4980 - acc: 0.7613
Epoch 9/100
 - 4s - loss: 0.4962 - acc: 0.7618
Epoch 10/100
 - 4s - loss: 0.4905 - acc: 0.7630
Epoch 11/100
 - 4s - loss: 0.4899 - acc: 0.7644
Epoch 12/100
 - 4s - loss: 0.4879 - acc: 0.7661
Epoch 13/100
 - 4s - loss: 0.4845 - acc: 0.7697
Epoch 14/100
 - 4s - loss: 0.4814 - acc: 0.7729
Epoch 15/100
 - 4s - loss: 0.4801 - acc: 0.7740
Epoch 16/100
 - 4s - loss: 0.4753 - acc: 0.7744
Epoch 17/100
 - 4s - loss: 0.4746 - acc: 0.7767
Epoch 18/100
 - 4s - loss: 0.4711 - acc: 0.7784
Epoch 19/100
 - 4s - loss: 0.4686 - acc: 0.7774
Epoch 20/100
 - 4s - loss: 0.4656 - acc: 0.7803
Epoch 21/100
 - 4s - loss: 0.4631 - acc: 0.7807
Epoch 22/100
 - 4s - loss: 0.4619 - acc: 0.7795
Epoch 23/100
 - 4s - loss: 0.4555 - acc: 0.7833
Epoch 24/100
 - 4s - loss: 0.4534 - acc: 0.7844
Epoch 25/100
 - 4s - loss: 0.4524 - acc: 0.7875
Epoch 26/100
 - 4s - loss: 0.4483 - acc: 0.7881
Epoch 27/100
 - 4s - loss: 0.4427 - acc: 0.7896
Epoch 28/100
 - 4s - loss: 0.4359 - acc: 0.7954
Epoch 29/100
 - 4s - loss: 0.4343 - acc: 0.7926
Epoch 30/100
 - 4s - loss: 0.4305 - acc: 0.7964
Epoch 31/100
 - 4s - loss: 0.4260 - acc: 0.7960
Epoch 32/100
 - 4s - loss: 0.4234 - acc: 0.7983
Epoch 33/100
 - 4s - loss: 0.4142 - acc: 0.8048
Epoch 34/100
 - 4s - loss: 0.4110 - acc: 0.8045
Epoch 35/100
 - 4s - loss: 0.4074 - acc: 0.8079
Epoch 36/100
 - 4s - loss: 0.4021 - acc: 0.8091
Epoch 37/100
 - 4s - loss: 0.3996 - acc: 0.8123
Epoch 38/100
 - 4s - loss: 0.3924 - acc: 0.8127
Epoch 39/100
 - 4s - loss: 0.3819 - acc: 0.8208
Epoch 40/100
 - 4s - loss: 0.3804 - acc: 0.8208
Epoch 41/100
 - 4s - loss: 0.3747 - acc: 0.8235
Epoch 42/100
 - 4s - loss: 0.3701 - acc: 0.8274
Epoch 43/100
 - 4s - loss: 0.3637 - acc: 0.8274
Epoch 44/100
 - 4s - loss: 0.3554 - acc: 0.8325
Epoch 45/100
 - 4s - loss: 0.3520 - acc: 0.8348
Epoch 46/100
 - 4s - loss: 0.3467 - acc: 0.8366
Epoch 47/100
 - 4s - loss: 0.3415 - acc: 0.8380
Epoch 48/100
 - 4s - loss: 0.3357 - acc: 0.8424
Epoch 49/100
 - 5s - loss: 0.3289 - acc: 0.8468
Epoch 50/100
 - 4s - loss: 0.3243 - acc: 0.8478
Epoch 51/100
 - 4s - loss: 0.3244 - acc: 0.8468
Epoch 52/100
 - 4s - loss: 0.3185 - acc: 0.8519
Epoch 53/100
 - 4s - loss: 0.3103 - acc: 0.8572
Epoch 54/100
 - 4s - loss: 0.3060 - acc: 0.8550
Epoch 55/100
 - 4s - loss: 0.3067 - acc: 0.8583
Epoch 56/100
 - 4s - loss: 0.2971 - acc: 0.8634
Epoch 57/100
 - 4s - loss: 0.2953 - acc: 0.8635
Epoch 58/100
 - 4s - loss: 0.2799 - acc: 0.8705
Epoch 59/100
 - 4s - loss: 0.2816 - acc: 0.8697
Epoch 60/100
 - 4s - loss: 0.2740 - acc: 0.8732
Epoch 61/100
 - 4s - loss: 0.2880 - acc: 0.8665
Epoch 62/100
 - 4s - loss: 0.2705 - acc: 0.8781
Epoch 63/100
 - 4s - loss: 0.2657 - acc: 0.8793
Epoch 64/100
 - 4s - loss: 0.2768 - acc: 0.8749
Epoch 65/100
 - 4s - loss: 0.2648 - acc: 0.8801
Epoch 66/100
 - 4s - loss: 0.2525 - acc: 0.8840
Epoch 67/100
 - 4s - loss: 0.2554 - acc: 0.8849
Epoch 68/100
 - 4s - loss: 0.2526 - acc: 0.8877
Epoch 69/100
 - 4s - loss: 0.2432 - acc: 0.8896
Epoch 70/100
 - 4s - loss: 0.2483 - acc: 0.8885
Epoch 71/100
 - 5s - loss: 0.2479 - acc: 0.8903
Epoch 72/100
 - 4s - loss: 0.2375 - acc: 0.8935
Epoch 73/100
 - 4s - loss: 0.2365 - acc: 0.8945
Epoch 74/100
 - 4s - loss: 0.2393 - acc: 0.8935
Epoch 75/100
 - 4s - loss: 0.2369 - acc: 0.8962
Epoch 76/100
 - 4s - loss: 0.2226 - acc: 0.9004
Epoch 77/100
 - 4s - loss: 0.2310 - acc: 0.8970
Epoch 78/100
 - 4s - loss: 0.2319 - acc: 0.9000
Epoch 79/100
 - 4s - loss: 0.2223 - acc: 0.9002
Epoch 80/100
 - 4s - loss: 0.2217 - acc: 0.9017
Epoch 81/100
 - 4s - loss: 0.2222 - acc: 0.9026
Epoch 82/100
 - 4s - loss: 0.2155 - acc: 0.9040
Epoch 83/100
 - 4s - loss: 0.1978 - acc: 0.9123
Epoch 84/100
 - 4s - loss: 0.2076 - acc: 0.9091
Epoch 85/100
 - 4s - loss: 0.2231 - acc: 0.9022
Epoch 86/100
 - 4s - loss: 0.2107 - acc: 0.9081
Epoch 87/100
 - 4s - loss: 0.1855 - acc: 0.9167
Epoch 88/100
 - 4s - loss: 0.2082 - acc: 0.9066
Epoch 89/100
 - 4s - loss: 0.1956 - acc: 0.9140
Epoch 90/100
 - 4s - loss: 0.2093 - acc: 0.9103
Epoch 91/100
 - 4s - loss: 0.2057 - acc: 0.9121
Epoch 92/100
 - 4s - loss: 0.1929 - acc: 0.9167
Epoch 93/100
 - 4s - loss: 0.1792 - acc: 0.9222
Epoch 94/100
 - 4s - loss: 0.1978 - acc: 0.9166
Epoch 95/100
 - 4s - loss: 0.1889 - acc: 0.9178
Epoch 96/100
 - 4s - loss: 0.1916 - acc: 0.9167
Epoch 97/100
 - 4s - loss: 0.1930 - acc: 0.9172
Epoch 98/100
 - 4s - loss: 0.1705 - acc: 0.9245
Epoch 99/100
 - 4s - loss: 0.1754 - acc: 0.9234
Epoch 100/100
 - 4s - loss: 0.1958 - acc: 0.9167
0.7242
 

The accuracy of the model using word2vec is 72.42%, and this is quiet good, let's compare it with other Doc2vec models.

Distributed Bag Of Words - DBOW

In the Doc2vec model, a word vector W is generated for each word, and a document vector D is generated for each document. The model also trains weights for a softmax hidden layer. we will use next the same steps used in the previous word2vec model, Training, averaging the vectors, feeding the neural network...

In [ ]:
from gensim.models import Doc2Vec
import multiprocessing

cores = multiprocessing.cpu_count()
model_dbow = Doc2Vec(dm=0, size=100, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_x_w2v)])
model_dbow.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)

def build_doc_Vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model_dbow[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: 
            continue
    if count != 0:
        vec /= count
    return vec

train_vecs_dbow = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_dbow = scale(train_vecs_dbow)
val_vecs_dbow = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_dbow = scale(val_vecs_dbow)

model = Sequential()
model.add(Dense(256, activation='relu', input_dim=100))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(train_vecs_dbow, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dbow, y_validation, batch_size=128, verbose=2)

print score[1]
 

Unfortunately, the accuracy on the validation set turned out to be 61.72%, which is a bit disappointing. But this model actually faster and consumes less memory, since there is no need to save the word vectors.

Distributed Memory Concatenation - DMC

Now we move to the Distributed Memory model, we will first try with concatenation method for training.

In [13]:
cores = multiprocessing.cpu_count()
model_dmc = Doc2Vec(dm=1, dm_concat=1, size=100, window=2, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_dmc.build_vocab([x for x in tqdm(all_x_w2v)])
model_dmc.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)

def build_doc_Vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model_dmc[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: 
            continue
    if count != 0:
        vec /= count
    return vec

  
train_vecs_dmc = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_dmc = scale(train_vecs_dmc)


val_vecs_dmc = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_dmc = scale(val_vecs_dmc)

model = Sequential()
model.add(Dense(256, activation='relu', input_dim=100))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(train_vecs_dmc, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dmc, y_validation, batch_size=128, verbose=2)

print score[1]
 
100%|██████████| 25000/25000 [00:00<00:00, 1489264.16it/s]
100%|██████████| 25000/25000 [00:00<00:00, 1181587.28it/s]
100%|██████████| 20000/20000 [00:40<00:00, 494.09it/s]
100%|██████████| 5000/5000 [00:09<00:00, 503.43it/s]
 
Epoch 1/100
 - 3s - loss: 0.4308 - acc: 0.8052
Epoch 2/100
 - 3s - loss: 0.3864 - acc: 0.8292
Epoch 3/100
 - 3s - loss: 0.3649 - acc: 0.8363
Epoch 4/100
 - 3s - loss: 0.3425 - acc: 0.8456
Epoch 5/100
 - 4s - loss: 0.3095 - acc: 0.8631
Epoch 6/100
 - 3s - loss: 0.2734 - acc: 0.8800
Epoch 7/100
 - 3s - loss: 0.2316 - acc: 0.8982
Epoch 8/100
 - 3s - loss: 0.1847 - acc: 0.9234
Epoch 9/100
 - 3s - loss: 0.1475 - acc: 0.9419
Epoch 10/100
 - 3s - loss: 0.1149 - acc: 0.9530
Epoch 11/100
 - 3s - loss: 0.0953 - acc: 0.9643
Epoch 12/100
 - 3s - loss: 0.0785 - acc: 0.9707
Epoch 13/100
 - 3s - loss: 0.0665 - acc: 0.9752
Epoch 14/100
 - 3s - loss: 0.0628 - acc: 0.9777
Epoch 15/100
 - 3s - loss: 0.0513 - acc: 0.9809
Epoch 16/100
 - 3s - loss: 0.0531 - acc: 0.9806
Epoch 17/100
 - 3s - loss: 0.0390 - acc: 0.9856
Epoch 18/100
 - 3s - loss: 0.0473 - acc: 0.9830
Epoch 19/100
 - 3s - loss: 0.0372 - acc: 0.9875
Epoch 20/100
 - 3s - loss: 0.0446 - acc: 0.9843
Epoch 21/100
 - 3s - loss: 0.0314 - acc: 0.9899
Epoch 22/100
 - 3s - loss: 0.0395 - acc: 0.9855
Epoch 23/100
 - 3s - loss: 0.0333 - acc: 0.9884
Epoch 24/100
 - 3s - loss: 0.0358 - acc: 0.9874
Epoch 25/100
 - 3s - loss: 0.0287 - acc: 0.9900
Epoch 26/100
 - 3s - loss: 0.0298 - acc: 0.9903
Epoch 27/100
 - 3s - loss: 0.0410 - acc: 0.9863
Epoch 28/100
 - 3s - loss: 0.0255 - acc: 0.9914
Epoch 29/100
 - 3s - loss: 0.0213 - acc: 0.9928
Epoch 30/100
 - 3s - loss: 0.0306 - acc: 0.9897
Epoch 31/100
 - 3s - loss: 0.0296 - acc: 0.9902
Epoch 32/100
 - 3s - loss: 0.0197 - acc: 0.9933
Epoch 33/100
 - 3s - loss: 0.0304 - acc: 0.9907
Epoch 34/100
 - 3s - loss: 0.0231 - acc: 0.9928
Epoch 35/100
 - 3s - loss: 0.0192 - acc: 0.9941
Epoch 36/100
 - 3s - loss: 0.0275 - acc: 0.9905
Epoch 37/100
 - 3s - loss: 0.0270 - acc: 0.9907
Epoch 38/100
 - 3s - loss: 0.0193 - acc: 0.9938
Epoch 39/100
 - 3s - loss: 0.0245 - acc: 0.9928
Epoch 40/100
 - 3s - loss: 0.0229 - acc: 0.9924
Epoch 41/100
 - 3s - loss: 0.0181 - acc: 0.9937
Epoch 42/100
 - 3s - loss: 0.0196 - acc: 0.9934
Epoch 43/100
 - 3s - loss: 0.0219 - acc: 0.9932
Epoch 44/100
 - 3s - loss: 0.0283 - acc: 0.9904
Epoch 45/100
 - 3s - loss: 0.0170 - acc: 0.9952
Epoch 46/100
 - 3s - loss: 0.0156 - acc: 0.9949
Epoch 47/100
 - 3s - loss: 0.0228 - acc: 0.9923
Epoch 48/100
 - 3s - loss: 0.0168 - acc: 0.9942
Epoch 49/100
 - 3s - loss: 0.0207 - acc: 0.9935
Epoch 50/100
 - 3s - loss: 0.0202 - acc: 0.9935
Epoch 51/100
 - 3s - loss: 0.0163 - acc: 0.9942
Epoch 52/100
 - 3s - loss: 0.0194 - acc: 0.9938
Epoch 53/100
 - 3s - loss: 0.0193 - acc: 0.9933
Epoch 54/100
 - 3s - loss: 0.0095 - acc: 0.9969
Epoch 55/100
 - 3s - loss: 0.0140 - acc: 0.9951
Epoch 56/100
 - 3s - loss: 0.0233 - acc: 0.9929
Epoch 57/100
 - 3s - loss: 0.0176 - acc: 0.9949
Epoch 58/100
 - 3s - loss: 0.0088 - acc: 0.9970
Epoch 59/100
 - 3s - loss: 0.0162 - acc: 0.9946
Epoch 60/100
 - 3s - loss: 0.0193 - acc: 0.9944
Epoch 61/100
 - 3s - loss: 0.0260 - acc: 0.9918
Epoch 62/100
 - 3s - loss: 0.0102 - acc: 0.9972
Epoch 63/100
 - 3s - loss: 0.0167 - acc: 0.9952
Epoch 64/100
 - 3s - loss: 0.0179 - acc: 0.9944
Epoch 65/100
 - 3s - loss: 0.0109 - acc: 0.9964
Epoch 66/100
 - 3s - loss: 0.0068 - acc: 0.9981
Epoch 67/100
 - 3s - loss: 0.0198 - acc: 0.9936
Epoch 68/100
 - 3s - loss: 0.0235 - acc: 0.9929
Epoch 69/100
 - 3s - loss: 0.0168 - acc: 0.9944
Epoch 70/100
 - 3s - loss: 0.0078 - acc: 0.9977
Epoch 71/100
 - 3s - loss: 0.0113 - acc: 0.9963
Epoch 72/100
 - 3s - loss: 0.0265 - acc: 0.9911
Epoch 73/100
 - 3s - loss: 0.0097 - acc: 0.9970
Epoch 74/100
 - 3s - loss: 0.0108 - acc: 0.9970
Epoch 75/100
 - 3s - loss: 0.0109 - acc: 0.9968
Epoch 76/100
 - 3s - loss: 0.0128 - acc: 0.9959
Epoch 77/100
 - 3s - loss: 0.0244 - acc: 0.9931
Epoch 78/100
 - 3s - loss: 0.0124 - acc: 0.9958
Epoch 79/100
 - 3s - loss: 0.0036 - acc: 0.9988
Epoch 80/100
 - 3s - loss: 0.0121 - acc: 0.9965
Epoch 81/100
 - 3s - loss: 0.0204 - acc: 0.9931
Epoch 82/100
 - 3s - loss: 0.0119 - acc: 0.9964
Epoch 83/100
 - 3s - loss: 0.0112 - acc: 0.9970
Epoch 84/100
 - 3s - loss: 0.0107 - acc: 0.9968
Epoch 85/100
 - 3s - loss: 0.0189 - acc: 0.9945
Epoch 86/100
 - 3s - loss: 0.0119 - acc: 0.9967
Epoch 87/100
 - 3s - loss: 0.0115 - acc: 0.9965
Epoch 88/100
 - 3s - loss: 0.0044 - acc: 0.9990
Epoch 89/100
 - 3s - loss: 0.0088 - acc: 0.9974
Epoch 90/100
 - 3s - loss: 0.0185 - acc: 0.9941
Epoch 91/100
 - 3s - loss: 0.0123 - acc: 0.9963
Epoch 92/100
 - 3s - loss: 0.0092 - acc: 0.9970
Epoch 93/100
 - 3s - loss: 0.0097 - acc: 0.9970
Epoch 94/100
 - 3s - loss: 0.0178 - acc: 0.9950
Epoch 95/100
 - 3s - loss: 0.0116 - acc: 0.9964
Epoch 96/100
 - 3s - loss: 0.0114 - acc: 0.9958
Epoch 97/100
 - 3s - loss: 0.0114 - acc: 0.9968
Epoch 98/100
 - 3s - loss: 0.0089 - acc: 0.9974
Epoch 99/100
 - 3s - loss: 0.0084 - acc: 0.9973
Epoch 100/100
 - 3s - loss: 0.0104 - acc: 0.9974
0.7944
 

The accuracy tested on the validation set with the 3 layers neural network is 79.44%. It seems like it’s doing its job.

Distributed Memory Mean - DMM

we can try another method of training DM model.

In [15]:
cores = multiprocessing.cpu_count()
model_dmm = Doc2Vec(dm=1, dm_mean=1, size=100, window=4, negative=5, min_count=2, workers=cores, alpha=0.065, min_alpha=0.065)
model_dmm.build_vocab([x for x in tqdm(all_x_w2v)])
model_dmm.train(utils.shuffle([x for x in tqdm(all_x_w2v)]), total_examples=len(all_x_w2v), epochs=1)

def build_doc_Vector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += model_dmm[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: 
            continue
    if count != 0:
        vec /= count
    return vec

  
train_vecs_dmm = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_train))])
train_vecs_dmm = scale(train_vecs_dmm)

val_vecs_dmm = np.concatenate([build_doc_Vector(z, 100) for z in tqdm(map(lambda x: x.words, x_validation))])
val_vecs_dmm = scale(val_vecs_dmm)

model = Sequential()
model.add(Dense(256, activation='relu', input_dim=100))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(train_vecs_dmm, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dmm, y_validation, batch_size=128, verbose=2)
print score[1]
 
100%|██████████| 25000/25000 [00:00<00:00, 965326.26it/s]
100%|██████████| 25000/25000 [00:00<00:00, 1285019.61it/s]
100%|██████████| 20000/20000 [00:39<00:00, 503.88it/s]
100%|██████████| 5000/5000 [00:09<00:00, 504.96it/s]
 
Epoch 1/100
 - 3s - loss: 0.4003 - acc: 0.8224
Epoch 2/100
 - 3s - loss: 0.3660 - acc: 0.8414
Epoch 3/100
 - 3s - loss: 0.3533 - acc: 0.8469
Epoch 4/100
 - 3s - loss: 0.3348 - acc: 0.8536
Epoch 5/100
 - 3s - loss: 0.3192 - acc: 0.8629
Epoch 6/100
 - 3s - loss: 0.2980 - acc: 0.8705
Epoch 7/100
 - 3s - loss: 0.2745 - acc: 0.8829
Epoch 8/100
 - 3s - loss: 0.2495 - acc: 0.8959
Epoch 9/100
 - 3s - loss: 0.2179 - acc: 0.9085
Epoch 10/100
 - 3s - loss: 0.1896 - acc: 0.9213
Epoch 11/100
 - 3s - loss: 0.1637 - acc: 0.9329
Epoch 12/100
 - 3s - loss: 0.1425 - acc: 0.9421
Epoch 13/100
 - 3s - loss: 0.1188 - acc: 0.9530
Epoch 14/100
 - 3s - loss: 0.1068 - acc: 0.9582
Epoch 15/100
 - 3s - loss: 0.0884 - acc: 0.9659
Epoch 16/100
 - 3s - loss: 0.0827 - acc: 0.9682
Epoch 17/100
 - 3s - loss: 0.0682 - acc: 0.9738
Epoch 18/100
 - 3s - loss: 0.0666 - acc: 0.9756
Epoch 19/100
 - 3s - loss: 0.0628 - acc: 0.9759
Epoch 20/100
 - 3s - loss: 0.0600 - acc: 0.9788
Epoch 21/100
 - 3s - loss: 0.0402 - acc: 0.9857
Epoch 22/100
 - 3s - loss: 0.0603 - acc: 0.9791
Epoch 23/100
 - 3s - loss: 0.0527 - acc: 0.9820
Epoch 24/100
 - 3s - loss: 0.0461 - acc: 0.9841
Epoch 25/100
 - 3s - loss: 0.0436 - acc: 0.9852
Epoch 26/100
 - 3s - loss: 0.0376 - acc: 0.9881
Epoch 27/100
 - 3s - loss: 0.0401 - acc: 0.9869
Epoch 28/100
 - 3s - loss: 0.0464 - acc: 0.9829
Epoch 29/100
 - 3s - loss: 0.0329 - acc: 0.9893
Epoch 30/100
 - 3s - loss: 0.0361 - acc: 0.9882
Epoch 31/100
 - 3s - loss: 0.0324 - acc: 0.9890
Epoch 32/100
 - 3s - loss: 0.0365 - acc: 0.9879
Epoch 33/100
 - 3s - loss: 0.0352 - acc: 0.9883
Epoch 34/100
 - 3s - loss: 0.0310 - acc: 0.9905
Epoch 35/100
 - 3s - loss: 0.0286 - acc: 0.9908
Epoch 36/100
 - 3s - loss: 0.0302 - acc: 0.9896
Epoch 37/100
 - 3s - loss: 0.0318 - acc: 0.9900
Epoch 38/100
 - 3s - loss: 0.0281 - acc: 0.9901
Epoch 39/100
 - 3s - loss: 0.0202 - acc: 0.9931
Epoch 40/100
 - 3s - loss: 0.0440 - acc: 0.9846
Epoch 41/100
 - 3s - loss: 0.0212 - acc: 0.9931
Epoch 42/100
 - 3s - loss: 0.0265 - acc: 0.9909
Epoch 43/100
 - 3s - loss: 0.0235 - acc: 0.9922
Epoch 44/100
 - 3s - loss: 0.0225 - acc: 0.9926
Epoch 45/100
 - 3s - loss: 0.0275 - acc: 0.9909
Epoch 46/100
 - 3s - loss: 0.0263 - acc: 0.9917
Epoch 47/100
 - 3s - loss: 0.0239 - acc: 0.9914
Epoch 48/100
 - 3s - loss: 0.0153 - acc: 0.9954
Epoch 49/100
 - 3s - loss: 0.0347 - acc: 0.9896
Epoch 50/100
 - 3s - loss: 0.0284 - acc: 0.9904
Epoch 51/100
 - 3s - loss: 0.0146 - acc: 0.9951
Epoch 52/100
 - 3s - loss: 0.0173 - acc: 0.9946
Epoch 53/100
 - 3s - loss: 0.0140 - acc: 0.9955
Epoch 54/100
 - 3s - loss: 0.0339 - acc: 0.9892
Epoch 55/100
 - 3s - loss: 0.0241 - acc: 0.9928
Epoch 56/100
 - 3s - loss: 0.0183 - acc: 0.9944
Epoch 57/100
 - 3s - loss: 0.0227 - acc: 0.9926
Epoch 58/100
 - 3s - loss: 0.0250 - acc: 0.9922
Epoch 59/100
 - 3s - loss: 0.0145 - acc: 0.9959
Epoch 60/100
 - 3s - loss: 0.0169 - acc: 0.9949
Epoch 61/100
 - 3s - loss: 0.0198 - acc: 0.9935
Epoch 62/100
 - 3s - loss: 0.0229 - acc: 0.9928
Epoch 63/100
 - 3s - loss: 0.0205 - acc: 0.9936
Epoch 64/100
 - 3s - loss: 0.0116 - acc: 0.9967
Epoch 65/100
 - 3s - loss: 0.0067 - acc: 0.9981
Epoch 66/100
 - 3s - loss: 0.0456 - acc: 0.9865
Epoch 67/100
 - 3s - loss: 0.0109 - acc: 0.9960
Epoch 68/100
 - 3s - loss: 0.0079 - acc: 0.9974
Epoch 69/100
 - 3s - loss: 0.0210 - acc: 0.9938
Epoch 70/100
 - 3s - loss: 0.0225 - acc: 0.9929
Epoch 71/100
 - 3s - loss: 0.0196 - acc: 0.9934
Epoch 72/100
 - 3s - loss: 0.0121 - acc: 0.9960
Epoch 73/100
 - 3s - loss: 0.0109 - acc: 0.9967
Epoch 74/100
 - 3s - loss: 0.0223 - acc: 0.9931
Epoch 75/100
 - 3s - loss: 0.0136 - acc: 0.9947
Epoch 76/100
 - 3s - loss: 0.0126 - acc: 0.9966
Epoch 77/100
 - 3s - loss: 0.0231 - acc: 0.9929
Epoch 78/100
 - 3s - loss: 0.0164 - acc: 0.9958
Epoch 79/100
 - 3s - loss: 0.0106 - acc: 0.9964
Epoch 80/100
 - 3s - loss: 0.0181 - acc: 0.9948
Epoch 81/100
 - 3s - loss: 0.0184 - acc: 0.9946
Epoch 82/100
 - 3s - loss: 0.0157 - acc: 0.9953
Epoch 83/100
 - 3s - loss: 0.0102 - acc: 0.9972
Epoch 84/100
 - 3s - loss: 0.0141 - acc: 0.9960
Epoch 85/100
 - 3s - loss: 0.0234 - acc: 0.9919
Epoch 86/100
 - 3s - loss: 0.0106 - acc: 0.9963
Epoch 87/100
 - 3s - loss: 0.0114 - acc: 0.9968
Epoch 88/100
 - 3s - loss: 0.0271 - acc: 0.9930
Epoch 89/100
 - 3s - loss: 0.0064 - acc: 0.9981
Epoch 90/100
 - 3s - loss: 0.0103 - acc: 0.9971
Epoch 91/100
 - 3s - loss: 0.0223 - acc: 0.9937
Epoch 92/100
 - 3s - loss: 0.0152 - acc: 0.9951
Epoch 93/100
 - 3s - loss: 0.0076 - acc: 0.9979
Epoch 94/100
 - 3s - loss: 0.0101 - acc: 0.9968
Epoch 95/100
 - 3s - loss: 0.0188 - acc: 0.9943
Epoch 96/100
 - 3s - loss: 0.0188 - acc: 0.9948
Epoch 97/100
 - 3s - loss: 0.0106 - acc: 0.9972
Epoch 98/100
 - 3s - loss: 0.0092 - acc: 0.9972
Epoch 99/100
 - 3s - loss: 0.0141 - acc: 0.9956
Epoch 100/100
 - 3s - loss: 0.0184 - acc: 0.9947
0.8024
 

The validation set accuracy is 80.24%, which is much better than DMC model and DBOW model.

Combined Model DBOW + DMC

In this part, we can concatenate the previous doc2vec models to see how it affects the performance. see we define a simple function to concatenate document vectors from different models as shown below.

In [17]:
def get_concat_vectors(model1,model2, corpus, size):
    vecs = np.zeros((len(corpus), size))
    n = 0
    for i in corpus.index:
        prefix = 'all_' + str(i)
        vecs[n] = np.append(model1.docvecs[prefix],model2.docvecs[prefix])
        n += 1
    return vecs
  
train_vecs_dbow_dmc = get_concat_vectors(model_dbow,model_dmc, x_train, 200)
val_vecs_dbow_dmc = get_concat_vectors(model_dbow,model_dmc, x_validation, 200)

model = Sequential()
model.add(Dense(256, activation='relu', input_dim=200))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(train_vecs_dbow_dmc, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dbow_dmc, y_validation, batch_size=128, verbose=2)

print score[1]
 
100%|██████████| 20000/20000 [01:01<00:00, 327.42it/s]
100%|██████████| 5000/5000 [00:15<00:00, 326.66it/s]
 
Epoch 1/100
 - 3s - loss: 0.4344 - acc: 0.8019
Epoch 2/100
 - 3s - loss: 0.3676 - acc: 0.8389
Epoch 3/100
 - 3s - loss: 0.3224 - acc: 0.8610
Epoch 4/100
 - 3s - loss: 0.2691 - acc: 0.8861
Epoch 5/100
 - 3s - loss: 0.2035 - acc: 0.9159
Epoch 6/100
 - 3s - loss: 0.1435 - acc: 0.9422
Epoch 7/100
 - 3s - loss: 0.1033 - acc: 0.9601
Epoch 8/100
 - 3s - loss: 0.0806 - acc: 0.9693
Epoch 9/100
 - 3s - loss: 0.0632 - acc: 0.9772
Epoch 10/100
 - 3s - loss: 0.0549 - acc: 0.9800
Epoch 11/100
 - 3s - loss: 0.0461 - acc: 0.9829
Epoch 12/100
 - 3s - loss: 0.0397 - acc: 0.9856
Epoch 13/100
 - 3s - loss: 0.0390 - acc: 0.9869
Epoch 14/100
 - 3s - loss: 0.0328 - acc: 0.9890
Epoch 15/100
 - 3s - loss: 0.0346 - acc: 0.9881
Epoch 16/100
 - 3s - loss: 0.0382 - acc: 0.9872
Epoch 17/100
 - 3s - loss: 0.0341 - acc: 0.9874
Epoch 18/100
 - 3s - loss: 0.0336 - acc: 0.9879
Epoch 19/100
 - 3s - loss: 0.0283 - acc: 0.9901
Epoch 20/100
 - 3s - loss: 0.0278 - acc: 0.9902
Epoch 21/100
 - 3s - loss: 0.0246 - acc: 0.9912
Epoch 22/100
 - 3s - loss: 0.0247 - acc: 0.9914
Epoch 23/100
 - 3s - loss: 0.0210 - acc: 0.9931
Epoch 24/100
 - 3s - loss: 0.0231 - acc: 0.9919
Epoch 25/100
 - 3s - loss: 0.0252 - acc: 0.9917
Epoch 26/100
 - 3s - loss: 0.0247 - acc: 0.9922
Epoch 27/100
 - 3s - loss: 0.0221 - acc: 0.9932
Epoch 28/100
 - 3s - loss: 0.0154 - acc: 0.9952
Epoch 29/100
 - 3s - loss: 0.0237 - acc: 0.9925
Epoch 30/100
 - 3s - loss: 0.0202 - acc: 0.9931
Epoch 31/100
 - 3s - loss: 0.0166 - acc: 0.9946
Epoch 32/100
 - 3s - loss: 0.0204 - acc: 0.9931
Epoch 33/100
 - 3s - loss: 0.0214 - acc: 0.9929
Epoch 34/100
 - 3s - loss: 0.0155 - acc: 0.9948
Epoch 35/100
 - 3s - loss: 0.0143 - acc: 0.9949
Epoch 36/100
 - 3s - loss: 0.0181 - acc: 0.9940
Epoch 37/100
 - 3s - loss: 0.0142 - acc: 0.9951
Epoch 38/100
 - 3s - loss: 0.0194 - acc: 0.9936
Epoch 39/100
 - 3s - loss: 0.0218 - acc: 0.9927
Epoch 40/100
 - 3s - loss: 0.0109 - acc: 0.9964
Epoch 41/100
 - 3s - loss: 0.0138 - acc: 0.9955
Epoch 42/100
 - 3s - loss: 0.0113 - acc: 0.9961
Epoch 43/100
 - 3s - loss: 0.0180 - acc: 0.9940
Epoch 44/100
 - 3s - loss: 0.0126 - acc: 0.9955
Epoch 45/100
 - 3s - loss: 0.0182 - acc: 0.9938
Epoch 46/100
 - 3s - loss: 0.0134 - acc: 0.9957
Epoch 47/100
 - 3s - loss: 0.0094 - acc: 0.9970
Epoch 48/100
 - 3s - loss: 0.0147 - acc: 0.9954
Epoch 49/100
 - 3s - loss: 0.0111 - acc: 0.9958
Epoch 50/100
 - 3s - loss: 0.0131 - acc: 0.9963
Epoch 51/100
 - 3s - loss: 0.0158 - acc: 0.9946
Epoch 52/100
 - 3s - loss: 0.0092 - acc: 0.9973
Epoch 53/100
 - 3s - loss: 0.0145 - acc: 0.9956
Epoch 54/100
 - 3s - loss: 0.0127 - acc: 0.9955
Epoch 55/100
 - 3s - loss: 0.0122 - acc: 0.9965
Epoch 56/100
 - 3s - loss: 0.0083 - acc: 0.9978
Epoch 57/100
 - 3s - loss: 0.0172 - acc: 0.9947
Epoch 58/100
 - 3s - loss: 0.0109 - acc: 0.9961
Epoch 59/100
 - 3s - loss: 0.0099 - acc: 0.9969
Epoch 60/100
 - 3s - loss: 0.0122 - acc: 0.9963
Epoch 61/100
 - 3s - loss: 0.0130 - acc: 0.9963
Epoch 62/100
 - 3s - loss: 0.0064 - acc: 0.9978
Epoch 63/100
 - 3s - loss: 0.0124 - acc: 0.9959
Epoch 64/100
 - 3s - loss: 0.0126 - acc: 0.9964
Epoch 65/100
 - 3s - loss: 0.0113 - acc: 0.9961
Epoch 66/100
 - 3s - loss: 0.0066 - acc: 0.9981
Epoch 67/100
 - 3s - loss: 0.0145 - acc: 0.9955
Epoch 68/100
 - 3s - loss: 0.0099 - acc: 0.9973
Epoch 69/100
 - 3s - loss: 0.0068 - acc: 0.9978
Epoch 70/100
 - 3s - loss: 0.0114 - acc: 0.9956
Epoch 71/100
 - 3s - loss: 0.0124 - acc: 0.9962
Epoch 72/100
 - 3s - loss: 0.0108 - acc: 0.9963
Epoch 73/100
 - 3s - loss: 0.0083 - acc: 0.9974
Epoch 74/100
 - 3s - loss: 0.0095 - acc: 0.9971
Epoch 75/100
 - 3s - loss: 0.0073 - acc: 0.9973
Epoch 76/100
 - 3s - loss: 0.0113 - acc: 0.9967
Epoch 77/100
 - 3s - loss: 0.0093 - acc: 0.9977
Epoch 78/100
 - 3s - loss: 0.0105 - acc: 0.9969
Epoch 79/100
 - 3s - loss: 0.0101 - acc: 0.9968
Epoch 80/100
 - 3s - loss: 0.0068 - acc: 0.9983
Epoch 81/100
 - 3s - loss: 0.0043 - acc: 0.9990
Epoch 82/100
 - 3s - loss: 0.0151 - acc: 0.9957
Epoch 83/100
 - 3s - loss: 0.0060 - acc: 0.9977
Epoch 84/100
 - 3s - loss: 0.0082 - acc: 0.9972
Epoch 85/100
 - 3s - loss: 0.0077 - acc: 0.9975
Epoch 86/100
 - 3s - loss: 0.0067 - acc: 0.9978
Epoch 87/100
 - 3s - loss: 0.0100 - acc: 0.9973
Epoch 88/100
 - 3s - loss: 0.0111 - acc: 0.9965
Epoch 89/100
 - 3s - loss: 0.0089 - acc: 0.9974
Epoch 90/100
 - 3s - loss: 0.0074 - acc: 0.9977
Epoch 91/100
 - 3s - loss: 0.0063 - acc: 0.9983
Epoch 92/100
 - 3s - loss: 0.0087 - acc: 0.9969
Epoch 93/100
 - 3s - loss: 0.0050 - acc: 0.9985
Epoch 94/100
 - 3s - loss: 0.0102 - acc: 0.9972
Epoch 95/100
 - 3s - loss: 0.0096 - acc: 0.9970
Epoch 96/100
 - 3s - loss: 0.0054 - acc: 0.9985
Epoch 97/100
 - 3s - loss: 0.0065 - acc: 0.9981
Epoch 98/100
 - 3s - loss: 0.0119 - acc: 0.9964
Epoch 99/100
 - 3s - loss: 0.0058 - acc: 0.9986
Epoch 100/100
 - 3s - loss: 0.0043 - acc: 0.9988
0.7956
 

The accuracy for DBOW + DMC model is 79.56%, which has improved from pure DBOW model and DMC model. Let’s try the combination DBOW and DMM.

In [18]:
train_vecs_dbow_dmm = get_concat_vectors(model_ug_dbow,model_ug_dmm, x_train, 200)
val_vecs_dbow_dmm = get_concat_vectors(model_ug_dbow,model_ug_dmm, x_validation, 200)

model = Sequential()
model.add(Dense(256, activation='relu', input_dim=200))
model.add(Dense(256, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(train_vecs_dbow_dmm, y_train, epochs=100, batch_size=32, verbose=2)
score = model.evaluate(val_vecs_dbow_dmm, y_validation, batch_size=128, verbose=2)

print score[1]
 
100%|██████████| 20000/20000 [01:01<00:00, 325.24it/s]
100%|██████████| 5000/5000 [00:15<00:00, 330.20it/s]
 
Epoch 1/100
 - 3s - loss: 0.4085 - acc: 0.8184
Epoch 2/100
 - 3s - loss: 0.3508 - acc: 0.8492
Epoch 3/100
 - 3s - loss: 0.3138 - acc: 0.8659
Epoch 4/100
 - 3s - loss: 0.2711 - acc: 0.8836
Epoch 5/100
 - 3s - loss: 0.2203 - acc: 0.9073
Epoch 6/100
 - 3s - loss: 0.1682 - acc: 0.9324
Epoch 7/100
 - 3s - loss: 0.1246 - acc: 0.9509
Epoch 8/100
 - 3s - loss: 0.0931 - acc: 0.9634
Epoch 9/100
 - 3s - loss: 0.0759 - acc: 0.9709
Epoch 10/100
 - 3s - loss: 0.0601 - acc: 0.9779
Epoch 11/100
 - 3s - loss: 0.0483 - acc: 0.9823
Epoch 12/100
 - 3s - loss: 0.0475 - acc: 0.9829
Epoch 13/100
 - 3s - loss: 0.0452 - acc: 0.9842
Epoch 14/100
 - 3s - loss: 0.0422 - acc: 0.9852
Epoch 15/100
 - 3s - loss: 0.0295 - acc: 0.9905
Epoch 16/100
 - 3s - loss: 0.0352 - acc: 0.9869
Epoch 17/100
 - 3s - loss: 0.0347 - acc: 0.9879
Epoch 18/100
 - 3s - loss: 0.0345 - acc: 0.9873
Epoch 19/100
 - 3s - loss: 0.0267 - acc: 0.9915
Epoch 20/100
 - 3s - loss: 0.0273 - acc: 0.9907
Epoch 21/100
 - 3s - loss: 0.0282 - acc: 0.9905
Epoch 22/100
 - 3s - loss: 0.0246 - acc: 0.9919
Epoch 23/100
 - 3s - loss: 0.0304 - acc: 0.9899
Epoch 24/100
 - 3s - loss: 0.0199 - acc: 0.9932
Epoch 25/100
 - 3s - loss: 0.0282 - acc: 0.9906
Epoch 26/100
 - 3s - loss: 0.0236 - acc: 0.9930
Epoch 27/100
 - 3s - loss: 0.0178 - acc: 0.9944
Epoch 28/100
 - 3s - loss: 0.0275 - acc: 0.9909
Epoch 29/100
 - 3s - loss: 0.0224 - acc: 0.9926
Epoch 30/100
 - 3s - loss: 0.0151 - acc: 0.9945
Epoch 31/100
 - 3s - loss: 0.0219 - acc: 0.9929
Epoch 32/100
 - 3s - loss: 0.0201 - acc: 0.9936
Epoch 33/100
 - 3s - loss: 0.0227 - acc: 0.9929
Epoch 34/100
 - 3s - loss: 0.0178 - acc: 0.9937
Epoch 35/100
 - 3s - loss: 0.0186 - acc: 0.9931
Epoch 36/100
 - 3s - loss: 0.0191 - acc: 0.9932
Epoch 37/100
 - 3s - loss: 0.0211 - acc: 0.9931
Epoch 38/100
 - 3s - loss: 0.0114 - acc: 0.9959
Epoch 39/100
 - 3s - loss: 0.0164 - acc: 0.9951
Epoch 40/100
 - 3s - loss: 0.0162 - acc: 0.9942
Epoch 41/100
 - 3s - loss: 0.0087 - acc: 0.9972
Epoch 42/100
 - 3s - loss: 0.0224 - acc: 0.9931
Epoch 43/100
 - 3s - loss: 0.0131 - acc: 0.9958
Epoch 44/100
 - 3s - loss: 0.0164 - acc: 0.9944
Epoch 45/100
 - 3s - loss: 0.0143 - acc: 0.9950
Epoch 46/100
 - 4s - loss: 0.0144 - acc: 0.9952
Epoch 47/100
 - 30s - loss: 0.0193 - acc: 0.9940
Epoch 48/100
 - 3s - loss: 0.0136 - acc: 0.9954
Epoch 49/100
 - 3s - loss: 0.0103 - acc: 0.9965
Epoch 50/100
 - 3s - loss: 0.0163 - acc: 0.9949
Epoch 51/100
 - 3s - loss: 0.0171 - acc: 0.9940
Epoch 52/100
 - 3s - loss: 0.0123 - acc: 0.9970
Epoch 53/100
 - 3s - loss: 0.0083 - acc: 0.9972
Epoch 54/100
 - 3s - loss: 0.0147 - acc: 0.9944
Epoch 55/100
 - 3s - loss: 0.0162 - acc: 0.9950
Epoch 56/100
 - 3s - loss: 0.0082 - acc: 0.9970
Epoch 57/100
 - 3s - loss: 0.0102 - acc: 0.9966
Epoch 58/100
 - 3s - loss: 0.0166 - acc: 0.9942
Epoch 59/100
 - 3s - loss: 0.0101 - acc: 0.9970
Epoch 60/100
 - 3s - loss: 0.0110 - acc: 0.9969
Epoch 61/100
 - 3s - loss: 0.0091 - acc: 0.9972
Epoch 62/100
 - 3s - loss: 0.0086 - acc: 0.9974
Epoch 63/100
 - 3s - loss: 0.0144 - acc: 0.9949
Epoch 64/100
 - 3s - loss: 0.0094 - acc: 0.9967
Epoch 65/100
 - 3s - loss: 0.0131 - acc: 0.9955
Epoch 66/100
 - 3s - loss: 0.0096 - acc: 0.9972
Epoch 67/100
 - 3s - loss: 0.0097 - acc: 0.9970
Epoch 68/100
 - 3s - loss: 0.0129 - acc: 0.9965
Epoch 69/100
 - 3s - loss: 0.0081 - acc: 0.9977
Epoch 70/100
 - 3s - loss: 0.0112 - acc: 0.9972
Epoch 71/100
 - 3s - loss: 0.0131 - acc: 0.9960
Epoch 72/100
 - 3s - loss: 0.0116 - acc: 0.9961
Epoch 73/100
 - 3s - loss: 0.0104 - acc: 0.9970
Epoch 74/100
 - 3s - loss: 0.0129 - acc: 0.9963
Epoch 75/100
 - 3s - loss: 0.0120 - acc: 0.9967
Epoch 76/100
 - 3s - loss: 0.0044 - acc: 0.9985
Epoch 77/100
 - 3s - loss: 0.0068 - acc: 0.9981
Epoch 78/100
 - 3s - loss: 0.0131 - acc: 0.9959
Epoch 79/100
 - 3s - loss: 0.0135 - acc: 0.9961
Epoch 80/100
 - 3s - loss: 0.0047 - acc: 0.9986
Epoch 81/100
 - 3s - loss: 0.0085 - acc: 0.9974
Epoch 82/100
 - 3s - loss: 0.0116 - acc: 0.9967
Epoch 83/100
 - 3s - loss: 0.0080 - acc: 0.9976
Epoch 84/100
 - 3s - loss: 0.0087 - acc: 0.9977
Epoch 85/100
 - 3s - loss: 0.0086 - acc: 0.9970
Epoch 86/100
 - 3s - loss: 0.0083 - acc: 0.9973
Epoch 87/100
 - 3s - loss: 0.0086 - acc: 0.9972
Epoch 88/100
 - 3s - loss: 0.0102 - acc: 0.9967
Epoch 89/100
 - 3s - loss: 0.0055 - acc: 0.9982
Epoch 90/100
 - 3s - loss: 0.0063 - acc: 0.9981
Epoch 91/100
 - 3s - loss: 0.0065 - acc: 0.9981
Epoch 92/100
 - 3s - loss: 0.0144 - acc: 0.9956
Epoch 93/100
 - 3s - loss: 0.0053 - acc: 0.9985
Epoch 94/100
 - 3s - loss: 0.0072 - acc: 0.9977
Epoch 95/100
 - 3s - loss: 0.0083 - acc: 0.9976
Epoch 96/100
 - 3s - loss: 0.0070 - acc: 0.9980
Epoch 97/100
 - 3s - loss: 0.0115 - acc: 0.9967
Epoch 98/100
 - 3s - loss: 0.0052 - acc: 0.9985
Epoch 99/100
 - 3s - loss: 0.0090 - acc: 0.9978
Epoch 100/100
 - 3s - loss: 0.0089 - acc: 0.9973
0.8166
 

The accuracy of this combination is 81.66, which is far better than all the previous models.

A summary of the results of this tutorial is given below.

Model Accuracy
Word2vec 72.42%
Dbow 61.72%
DMC 79.44%
DMM 80.24%
DBOW + DMC 79.56%
DBOW + DMM 81.66%

In the next part, we will implement and compare the Convolutional Neural Network and LSTM models for our task sentiment analysis.

Thanks for your reading :)

 

Oumaima Hourrane

PhD Student at Faculty of Science Ben M'Sik Casablanca