This is a part of tutorial series on classifying the sentiments of IMDB movie reviews using machine learning and deep learning techniques. In the last part (link) of this series, I have shown how we can get word embeddings and classify the sentiments of our corpus-based using Word2vec and Doc2vec. In this part, I use a one-layered convolution neural network and compare it with LSTM at the and of this tutorial.
Convolutional Neural Network
In the previous kernel link, We have aggregated the word vectors of each word using Tf-IDF weighting to get one vector representation of each text, in order to feed a simple neural network. It is not the case for CNN, where we have to feed word vectors in a sequence to the model. we can also consider that a neural network expects all the data to have the same dimension. however different sentences have different sizes. and this can be handled next with sequence padding.
Let's begin at first, with loading our cleaned data (see the data pre-processing in this post), and split it into training and validation set.
import os
import sys
import pandas as pd
csv = '~/clean_data.csv'
data = pd.read_csv(csv,index_col=0, encoding='latin-1')
data.head()
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation, y_train, y_validation = train_test_split(data.SentimentText, data.Sentiment, test_size=.2, random_state=SEED)
Next, we will use Keras Tokenizer to split each word in a sentence. Then, in order to get a sequential representation of each row we use texts_to_sequences method. after that, we check the first five entries and there sequential representation, where each word is represented by a number.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=100000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)
for x in x_train[:5]:
print x
sequences[:5]
Now, we can check the max length of rows in our corpus for padding.
length = []
for x in x_train:
length.append(len(x.split()))
max(length)
We can say that the maximum length to be 1000
x_train_seq = pad_sequences(sequences, maxlen=1000)
x_train_seq[:5]
After checking, we can see that all the data transformed to have the same length of 1000.
We do the same thing to the validation set.
sequences_val = tokenizer.texts_to_sequences(x_validation)
x_val_seq = pad_sequences(sequences_val, maxlen=1000)
Next, we will define a CNN using an embedding layer of 200x1000 dimension as an input with 100000 as a max feature, then add to our 1D Convolutional layer 100x2000 filters, then add Global Max Pooling layer which will extract the maximum value from each filter. Finally, the output will be a one-dimensional vector with length equal to the number of the filters.
from keras.models import Sequential
from keras.layers import Dense, Conv1D, GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
model_cnn = Sequential()
e = Embedding(100000, 100, input_length=1000)
model_cnn.add(e)
model_cnn.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
model_cnn.add(GlobalMaxPooling1D())
model_cnn.add(Dense(256, activation='relu'))
model_cnn.add(Dense(1, activation='sigmoid'))
model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_cnn.fit(x_train_seq, y_train, validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32, verbose=2)
score,acc = model_cnn.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))
After 15 short minutes of training, we get the above accuracy, which seems a better result from all the model I've run in previous kernels.
Long Short-Term Memory
Let's now try another model LSTM and compare it with the previous CNN model. We will use a single LSTM layer preceded by an embedding layer with 100000 as a max feature and 128 dimensions of each word in a sequence, then followed by a dense layer with softmax function.
#LSTM
from keras.layers import SpatialDropout1D, LSTM, Dropout
model_lstm = Sequential()
model_lstm.add(Embedding(100000, 128))
model_lstm.add(SpatialDropout1D(0.4))
model_lstm.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1,activation='softmax'))
model_lstm.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model_lstm.summary())
model_lstm.fit(x_train_seq, y_train, epochs = 7, batch_size=32, verbose = 2)
score,acc = model_lstm.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))
The result after the training and the validation step give us a poor accuracy compared to the CNN, the model was even slower. Thus this can be improved by tunning the hyperparameter, it can be even faster as well if we combine it with another model including CNN itself.
Thanks for your reading :)
Oumaima Hourrane
Ph.D. Student at Faculty of Science Ben M'Sik Casablanca