This is a part of tutorial series on classifying the sentiments of IMDB movie reviews using machine learning and deep learning techniques. In the last part (link) of this series, I have shown how we can get word embeddings and classify the sentiments of our corpus-based using Word2vec and Doc2vec. In this part, I use a one-layered convolution neural network and compare it with LSTM at the and of this tutorial.

Convolutional Neural Network

In the previous kernel link, We have aggregated the word vectors of each word using Tf-IDF weighting to get one vector representation of each text, in order to feed a simple neural network. It is not the case for CNN, where we have to feed word vectors in a sequence to the model. we can also consider that a neural network expects all the data to have the same dimension. however different sentences have different sizes. and this can be handled next with sequence padding.

Let's begin at first, with loading our cleaned data (see the data pre-processing in this post), and split it into training and validation set.

In [1]:
import os
import sys
import pandas as pd

csv = '~/clean_data.csv'
data = pd.read_csv(csv,index_col=0, encoding='latin-1')
data.head()
Out[1]:
  SentimentText Sentiment
0 first think another disney movie might good it... 1
1 put aside dr house repeat missed desperate hou... 0
2 big fan stephen king s work film made even gre... 1
3 watched horrid thing tv needless say one movie... 0
4 truly enjoyed film acting terrific plot jeff c... 1
In [2]:
from sklearn.cross_validation import train_test_split

SEED = 2000

x_train, x_validation, y_train, y_validation = train_test_split(data.SentimentText, data.Sentiment, test_size=.2, random_state=SEED)
 
/home/oumaima/anaconda2/envs/nlp/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
 

Next, we will use Keras Tokenizer to split each word in a sentence. Then, in order to get a sequential representation of each row we use texts_to_sequences method. after that, we check the first five entries and there sequential representation, where each word is represented by a number.

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=100000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)

for x in x_train[:5]:
    print x    
sequences[:5]
 
/home/oumaima/.local/lib/python2.7/site-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.22) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)
/home/oumaima/anaconda2/envs/nlp/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
 
ballad django meandering mess movie spaghetti western simply collection scenes and much better films supposedly tied together django telling brought different outlaws hunt powers john cameron brings nothing role django skip one unless every django movie made even may good enough excuse see one
film really terrible terrible waste minutes life special effects terrible acting was not convincing its crocodile attack view tourists filming documentary blood surfing blood surfing surf around sharks turns terrible wrong foot crocodile interrupts holiday sharks do not look real crocodile even worse gets even pathetic running away form creature crocodile gets stuck females flash it deaths fake pirates fill time a pointless terrible film thats worth seeing
like many americans first introduced works hayao miyazaki saw spirited away fell love film seen many times search see every film miyazaki one earlier works castle sky although it s still enjoyable it s good spirited away though comparing film masterpiece perhaps unfair a young boy named pazu james van der beek working mine late one night sees girl fall slowly sky wakes next morning introduces sheeta anna paquin sheeta secret knows it pazu pulled adventure lead danger pirates army lost floating city going film hayao miyazaki means expect one thing sense wonder magic many filmmakers tried one create sense magic awe like miyazaki watching film miyazaki like experiencing fantastic dream childhood because film animated dubbing film pose much problem next impossible determine whether lip movements match words also helps translated dialogue well written voiced talented actors voice acting varied james van der beek fares best brings irresistible enthusiasm excitement role pazu perfect character anna paquin nearly good sheeta she s frightened events going around her knows do mark hamill unrecognizable evil muska he s dangerous wants something sheeta anything get it voices bad cloris leachman awful dola leachman may oscar the last picture show she s annoying pirate leader leachman gives character obnoxious squawk that s nearly always monotonous it s bad nearly ruins film jim cummings effective voice over actor he s miscast general i would definitely recommend seeing castle sky i ll probably end buying myself even though it s good spirited away it s still pretty good
remember loving show kid thought helicopter coolest thing i ve seen ultra high tech it s time could repel enemy fire sorts acrobatics air take nearly anything it s way go back watch today surprised lousy show really is casts members hardly compelling lot cheesy moments fight scenes incredibly fake looking nearly every ending helicopter fighting crap obvious reuse grainy low quality stock footage lot footages appear date vietnam war era airwolf basically theme knight rider except crime fighting vehicle choice helicopter instead car watching episodes found utterly bored do however love theme music
far cinematography goes film pretty good mid s times lighting way hot shots generally frame stayed focus acting average low budget stinker direction horrible several scenes dragged way long attempt suspense effects non existent attack skull pond completely removed final cut every attempt bring life skull obvious stick pokes strings also could not help think budget did not allow furnish house kept making references movers things storage coming soon honestly it would entertaining worse movie was not bad enough good bad movie was not good enough good either get mst k version it s fun
Out[3]:
[[14360,
  15040,
  6080,
  842,
  2,
  6358,
  925,
  249,
  1513,
  61,
  56,
  18,
  54,
  36,
  1415,
  3022,
  208,
  15040,
  909,
  762,
  193,
  10742,
  2115,
  1674,
  220,
  3327,
  871,
  83,
  129,
  15040,
  1659,
  6,
  833,
  92,
  15040,
  2,
  27,
  13,
  118,
  9,
  113,
  1265,
  16,
  6],
 [3,
  15,
  312,
  312,
  368,
  154,
  41,
  226,
  212,
  312,
  45,
  106,
  4,
  1001,
  2200,
  4629,
  1184,
  595,
  8722,
  1322,
  564,
  455,
  4441,
  455,
  4441,
  7552,
  104,
  7825,
  427,
  312,
  276,
  1985,
  4629,
  16671,
  3043,
  7825,
  19,
  4,
  85,
  69,
  4629,
  13,
  356,
  131,
  13,
  1181,
  571,
  160,
  751,
  1514,
  4629,
  131,
  1504,
  5203,
  3065,
  5,
  2348,
  1147,
  6587,
  2074,
  12,
  124,
  1106,
  312,
  3,
  1468,
  213,
  237],
 [7,
  40,
  1408,
  25,
  1660,
  422,
  15826,
  4835,
  135,
  3399,
  160,
  1485,
  43,
  3,
  38,
  40,
  127,
  1686,
  16,
  92,
  3,
  4835,
  6,
  825,
  422,
  1609,
  1602,
  179,
  5,
  1,
  58,
  643,
  5,
  1,
  9,
  3399,
  160,
  73,
  4347,
  3,
  902,
  306,
  5398,
  124,
  98,
  321,
  682,
  10437,
  507,
  1112,
  4787,
  16672,
  683,
  1824,
  447,
  6,
  225,
  1017,
  155,
  731,
  1245,
  1602,
  5092,
  288,
  2048,
  4392,
  10438,
  1870,
  12236,
  10438,
  917,
  617,
  5,
  10437,
  1871,
  1110,
  406,
  2337,
  6587,
  1171,
  334,
  4442,
  461,
  88,
  3,
  15826,
  4835,
  733,
  443,
  6,
  76,
  203,
  501,
  1094,
  40,
  955,
  702,
  6,
  892,
  203,
  1094,
  3862,
  7,
  4835,
  70,
  3,
  4835,
  7,
  6470,
  684,
  809,
  1469,
  1654,
  3,
  1042,
  3006,
  3,
  6806,
  18,
  359,
  288,
  1062,
  7987,
  655,
  5204,
  4229,
  941,
  629,
  23,
  1481,
  5093,
  333,
  17,
  319,
  4275,
  948,
  75,
  451,
  45,
  7052,
  507,
  1112,
  4787,
  16672,
  15041,
  44,
  871,
  8723,
  4577,
  2252,
  129,
  10437,
  320,
  33,
  1870,
  12236,
  676,
  9,
  10438,
  248,
  1,
  5014,
  621,
  88,
  104,
  252,
  617,
  19,
  850,
  8927,
  12237,
  365,
  18738,
  91,
  1,
  1703,
  402,
  65,
  10438,
  144,
  22,
  5,
  2116,
  24,
  12238,
  11421,
  292,
  17647,
  11421,
  118,
  647,
  10,
  145,
  357,
  49,
  248,
  1,
  548,
  6251,
  2014,
  11421,
  326,
  33,
  2844,
  42404,
  34,
  1,
  676,
  130,
  8724,
  5,
  1,
  24,
  676,
  5015,
  3,
  1135,
  16673,
  1076,
  451,
  412,
  188,
  91,
  1,
  3125,
  753,
  8,
  11,
  322,
  296,
  237,
  1609,
  1602,
  8,
  151,
  156,
  57,
  2595,
  1729,
  13,
  73,
  5,
  1,
  9,
  3399,
  160,
  5,
  1,
  58,
  100,
  9],
 [307,
  1631,
  49,
  450,
  115,
  4393,
  7420,
  76,
  8,
  64,
  38,
  3261,
  231,
  5205,
  5,
  1,
  12,
  20,
  29209,
  2553,
  866,
  2508,
  16674,
  841,
  110,
  676,
  144,
  5,
  1,
  28,
  62,
  68,
  35,
  431,
  719,
  2216,
  49,
  15,
  48,
  6684,
  1010,
  897,
  1379,
  90,
  862,
  311,
  470,
  61,
  881,
  1147,
  177,
  676,
  92,
  196,
  4393,
  930,
  508,
  499,
  25930,
  5541,
  282,
  414,
  1986,
  836,
  90,
  23509,
  867,
  1246,
  2802,
  250,
  860,
  25931,
  610,
  664,
  5468,
  6081,
  467,
  725,
  930,
  2059,
  1043,
  4393,
  221,
  448,
  70,
  576,
  172,
  1195,
  1012,
  19,
  108,
  43,
  664,
  141],
 [146,
  532,
  184,
  3,
  100,
  9,
  1632,
  1,
  127,
  1490,
  28,
  787,
  566,
  1130,
  2088,
  2554,
  1073,
  45,
  763,
  282,
  275,
  4048,
  393,
  442,
  377,
  61,
  2985,
  28,
  114,
  492,
  736,
  212,
  613,
  2926,
  1184,
  5337,
  8152,
  260,
  3720,
  396,
  518,
  92,
  492,
  665,
  41,
  5337,
  499,
  1156,
  9362,
  5542,
  23,
  20,
  4,
  257,
  31,
  275,
  63,
  4,
  1643,
  29210,
  229,
  712,
  149,
  1987,
  29211,
  102,
  13186,
  519,
  454,
  1177,
  5,
  11,
  371,
  356,
  2,
  106,
  4,
  24,
  113,
  9,
  24,
  2,
  106,
  4,
  9,
  113,
  9,
  268,
  22,
  3044,
  1409,
  233,
  5,
  1,
  169]]
 

Now, we can check the max length of rows in our corpus for padding.

In [4]:
length = []
for x in x_train:
    length.append(len(x.split()))
max(length)
Out[4]:
974
 

We can say that the maximum length to be 1000

In [5]:
x_train_seq = pad_sequences(sequences, maxlen=1000)
x_train_seq[:5]
Out[5]:
array([[   0,    0,    0, ..., 1265,   16,    6],
       [   0,    0,    0, ..., 1468,  213,  237],
       [   0,    0,    0, ...,   58,  100,    9],
       [   0,    0,    0, ...,   43,  664,  141],
       [   0,    0,    0, ...,    5,    1,  169]], dtype=int32)
 

After checking, we can see that all the data transformed to have the same length of 1000.

We do the same thing to the validation set.

In [6]:
sequences_val = tokenizer.texts_to_sequences(x_validation)
x_val_seq = pad_sequences(sequences_val, maxlen=1000)
 

Next, we will define a CNN using an embedding layer of 200x1000 dimension as an input with 100000 as a max feature, then add to our 1D Convolutional layer 100x2000 filters, then add Global Max Pooling layer which will extract the maximum value from each filter. Finally, the output will be a one-dimensional vector with length equal to the number of the filters.

In [9]:
from keras.models import Sequential
from keras.layers import Dense, Conv1D, GlobalMaxPooling1D
from keras.layers.embeddings import Embedding

model_cnn = Sequential()

e = Embedding(100000, 100, input_length=1000)
model_cnn.add(e)
model_cnn.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
model_cnn.add(GlobalMaxPooling1D())
model_cnn.add(Dense(256, activation='relu'))
model_cnn.add(Dense(1, activation='sigmoid'))
model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_cnn.fit(x_train_seq, y_train, validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32, verbose=2)
score,acc = model_cnn.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))
 
Train on 20000 samples, validate on 5000 samples
Epoch 1/5
 - 348s - loss: 0.3824 - acc: 0.8200 - val_loss: 0.3258 - val_acc: 0.8558
Epoch 2/5
 - 348s - loss: 0.1290 - acc: 0.9525 - val_loss: 0.3160 - val_acc: 0.8774
Epoch 3/5
 - 344s - loss: 0.0186 - acc: 0.9961 - val_loss: 0.4061 - val_acc: 0.8758
Epoch 4/5
 - 349s - loss: 0.0015 - acc: 1.0000 - val_loss: 0.4259 - val_acc: 0.8834
Epoch 5/5
 - 345s - loss: 2.4182e-04 - acc: 1.0000 - val_loss: 0.4442 - val_acc: 0.8846
score: 0.44
acc: 0.88
 

After 15 short minutes of training, we get the above accuracy, which seems a better result from all the model I've run in previous kernels.

Long Short-Term Memory

Let's now try another model LSTM and compare it with the previous CNN model. We will use a single LSTM layer preceded by an embedding layer with 100000 as a max feature and 128 dimensions of each word in a sequence, then followed by a dense layer with softmax function.

In [17]:
#LSTM

from keras.layers import SpatialDropout1D, LSTM, Dropout

model_lstm = Sequential()

model_lstm.add(Embedding(100000, 128))
model_lstm.add(SpatialDropout1D(0.4))
model_lstm.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1,activation='softmax'))
model_lstm.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model_lstm.summary())
model_lstm.fit(x_train_seq, y_train, epochs = 7, batch_size=32, verbose = 2)
score,acc = model_lstm.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))
 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_9 (Embedding)      (None, None, 128)         12800000  
_________________________________________________________________
spatial_dropout1d_7 (Spatial (None, None, 128)         0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 129       
=================================================================
Total params: 12,931,713
Trainable params: 12,931,713
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/7
 - 1265s - loss: 7.9592 - acc: 0.5007
Epoch 2/7
 - 1216s - loss: 7.9592 - acc: 0.5007
Epoch 3/7
 - 1210s - loss: 7.9592 - acc: 0.5007
Epoch 4/7
 - 1215s - loss: 7.9592 - acc: 0.5007
Epoch 5/7
 - 1313s - loss: 7.9592 - acc: 0.5007
Epoch 6/7
 - 1435s - loss: 7.9592 - acc: 0.5007
Epoch 7/7
 - 1279s - loss: 7.9592 - acc: 0.5007
score: 8.02
acc: 0.50
 

The result after the training and the validation step give us a poor accuracy compared to the CNN, the model was even slower. Thus this can be improved by tunning the hyperparameter, it can be even faster as well if we combine it with another model including CNN itself.

Thanks for your reading :)

 

Oumaima Hourrane

Ph.D. Student at Faculty of Science Ben M'Sik Casablanca