Word Vectors and NLP Modeling from BoW to BERT

Since the advent of word2vec, neural word embeddings have become a go to method for encapsulating distributional semantics in text applications. This series will review the strengths and weaknesses of using pre-trained word embeddings and demonstrate how to incorporate more complex semantic representation schemes such as Semantic Role Labeling, Abstract Meaning Representation and Semantic Dependency Parsing into your applications.


The last post in this series reviewed some of the recent milestones in neural natural language processing. In this post we will review some of the advancements in text representation.

Computers are unable to understand the concepts of words. In order to process natural language a mechanism for representing text is required. The standard mechanism for text representation are word vectors where words or phrases from a given language vocabulary are mapped to vectors of real numbers.

Traditional Word Vectors

Before diving directly into Word2Vec it’s worth while to do a brief overview of some of the traditional methods that pre-date neural embeddings.

Bag of Words or BoW vector representations are the most common used traditional vector representation. Each word or n-gram is linked to a vector index and marked as 0 or 1 depending on whether it occurs in a given document.

An example of a one hot bag of words representation for documents with one word.

BoW representations are often used in methods of document classification where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. One challenge with bag of word representations is that they don’t encode any information with regards to the meaning of a given word.

In BoW word occurrences are evenly weighted independently of how frequently or what context they occur. However in most NLP tasks some words are more relevant than others.

TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word or n-gram is to a document in a collection or corpus. They provide some weighting to a given word based on the context it occurs.The tf–idf value increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently than others.


However even though tf-idf BoW representations provide weights to different words they are unable to capture the word meaning.

As the famous linguist J. R. Firth said in 1935, “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.

Distributional Embeddings enable word vectors to encapsulate contextual context. Each embedding vector is represented based on the mutual information it has with other words in a given corpus. Mutual information can be represented as a global co-occurrence frequency or restricted to a given window either sequentially or based on dependency edges.

An example distributional embedding matrix each row encodes distributional context based on the count of the words it co-occurs with

Distributional vectors predate neural methods for word embeddings and the techniques surrounding them are still relevant as they provide insight into better interpreting what neural embeddings learn. For more information one should read the work of Goldberg and Levy.

Neural Embeddings


Predictive models learn their vectors in order to improve their predictive ability of a loss such as the loss of predicting the vector for a target word from the vectors of the surrounding context words.

Word2Vec is a predictive embedding model. There are two main Word2Vec architectures that are used to produce a distributed representation of words:

  • Continuous bag-of-words (CBOW) — The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.
  • Continuous skip-gram weighs nearby context words more heavily than more distant context words. While order still is not captured each of the context vectors are weighed and compared independently vs CBOW which weighs against the average context.
CBOW and Skip-Gram Architectures

CBOW is faster while skip-gram is slower but does a better job for infrequent words.


Both CBOW and Skip-Grams are “predictive” models, in that they only take local contexts into account. Word2Vec does not take advantage of global context. GloVe embeddings by contrast leverage the same intuition behind the co-occuring matrix used distributional embeddings, but uses neural methods to decompose the co-occurrence matrix into more expressive and dense word vectors. While GloVe vectors are faster to train, neither GloVe or Word2Vec has been shown to provide definitively better results rather they should both be evaluated for a given dataset.


FastText, builds on Word2Vec by learning vector representations for each word and the n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to training it enables word embeddings to encode sub-word information. FastText vectors have been shown to be more accurate than Word2Vec vectors by a number of different measures

A 10,000 foot overview of Neural NLP Architectures

In addition to better word vector representation the advent of neural has led to advances in machine learning architectures that have enabled the advances listed in the previous post.

This section will highlight some of the key developments in neural architecture that enabled some of the NLP advances seen thus far. This not meant to be an exhaustive review of deep learning and machine learning NLP architecture, rather the goal is to demonstrate the changes that are driving NLP forward.

Deep Feed Forward Networks

The advent of linear deep feed forward networks also known as multi layer perceptrons (MLP) in NLP introduced the potential for non linear modeling. This development helps with NLP because there are cases where the embedding space may be non linear. Take the following example of a documents whose embedding space is non linear meaning there is no way to linear divide the two document groups.

It doesn’t matter how you fit a line there is no linear way to split the spam and ham documents

A non linear MLP network provides the ability to properly model such non linearities.


This development by itself however did not bring about a significant revolution in NLP, since MLPs are unable to model word ordering. While MLPs open the door for marginal improvements in tasks such as language classification, where decisions can be made by modeling independent character frequencies, for more complex or ambiguous tasks standalone MLPs fall short.


Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification

Prior to their application in NLP Convolutional Neural Networks (CNNs)provided groundbreaking results computer vision with the advent of AlexNetIn NLP instead of convolving over pixels convulsion filters are applied and pooled sequentially over individual or groups of word vectors

In NLP CNNs are able to model local ordering by acting as n-gram feature extractors for embeddings. CNN models have contributed to state of the art results in classification and a variety of other NLP tasks.

More recently the work of Jacovi and Golberg et al, has contributed to deeper understanding of what convolutional filters learn by demonstrating that filters are able to model rich semantic classes of n-grams by using different activation patterns, and that global max-pooling induces behavior which filters out less relevant n-grams from model decision process.

A good primer on getting started with 1D CNNs can be found in the embedded link below.


Building on the local ordering provide by CNNs Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provide mechanisms for modeling sequential ordering and mid range dependencies in text such as the affect of a word in the beginning of a sentence on the end of a sentence.

Additional variations of RNNs such as Bidirectional-RNNs which process text in both left to right and right to left and character level RNNs for enhancing underrepresented or out of vocabulary word embeddings led to many state of the art neural NLP breakthroughs.

An sample of some different RNN architectures and coupled with example use cases.

Attention and Copy Mechanisms

While standard RNN architectures have led to incredible breakthroughs in NLP they suffer from a variety of challenges. While in theory they can capture long term dependencies they tend to struggle modeling longer sequences, this is still an open problem.

One cause for sub-optimal performance standard RNN encoder-decoder models for sequence to sequence tasks such as NER or translation is that they weight the impact each input vector evenly on each output vector when in reality specific words in the input sequence may carry more importance at different time steps.

Attention mechanisms provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. These mechanisms are responsible for much of the current or near current state of the art in Natural language processing.

An example of an attention mechanism applied to the task of neural translation in Microsoft Translator

Additionally in Machine Reading Comprehension and Summarization systems RNNs often tend to generate results, that while on first glance look structurally correct are in reality hallucinated or incorrect. One mechanism that helps mitigate some of these issues is the Copy Mechanism.

Copy Mechanism from Get To The Point: Summarization with Pointer-Generator Networks Abigail See, et all

The copy mechanism is an additional layer applied during decoding that decides whether it is better to generate the next word from the source sentence or from the general embedding vocabulary.

Putting it all together with ELMo and BERT

ELMo is a model generates embeddings for a word based on the context it appears thus generating slightly different embeddings for each of its occurrence.


For example, the word “play” in the sentence above using standard word embeddings encodes multiple meanings such as the verb to play or in the case of the sentence a theatre production. In standard word embeddings such as Glove, Fast Text or Word2Vec each instance of the word play would have the same representation.

ELMo enables NLP models to better disambiguate between the correct sense of a given word. On in it’s release it enabled near instant state of the art results in many downstream tasks, including tasks such as co-reference were previously not as viable for practical usage.


ELMo also provides promising implications for preforming transfer learning on out of domain datasets. Some such as Sebastien Ruder have even hailed the coming ELMo as the ImageNet moment of NLP and while ELMo is a very promising development with practical real world applications, and has spawned recent related techniques such as BERT, that use attention transformers instead of bi-directonal RNNs to encode context, we will see in our upcoming post that there are still many obstacles in the world of Neural NLP.

Comparsion of BERT and ELMo architectures from Devlin et. all

Call To Action: Getting Started

Below are some resources to get started with the the different word embeddings above.



Open Dataset

Next Post

Now that we have a solid understanding of some of the milestones in neural NLP, as well as the models and representations in the next post will review some of the pitfalls of current state of the art NLP systems.

If you have any questions, comments, or topics you would like me to discuss feel free to follow me on Twitter.

Article written : Aaron (Ari) Bornstein (source)

About the Author
Aaron (Ari) Bornstein is an avid AI enthusiast with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.