Day 113(DL) — NLP Data Preprocessing — word embeddings(part3)

For Today’s post, it’s all about the words to vectors(i.e) converting words into numbers that the computers can efficiently understand. Let’s start with some examples,

Text1: The cat chases the rat.

Text2: The dog chases the cat.

we can start assigning sequence numbers to the words,

The = 1, cat=2, chases=3, rat=4, dog=5

Can we directly replace the words with the respective numbers? Let’s see what happens when we do that,

Text1: 1 2 3 1 4

Text2: 1 5 3 1 2

The above approach of effortlessly using the numbers might alter the meaning (i.e) some of the words have high numbers and some are considered low which is completely incorrect. When dealing with words, they should be treated as qualitative(discrete) variables. For such scenarios, each and every word/number is one-hot encoded. Applying the one-hot encoding to our example, totally we’ve 5 unique words.

Fig1 — shows one hot-encoded data of the input texts

This technique seems to be simple and also relevant. But think of applying the same concept to 100’s and 1000’s of words or even more. The one-hot matrix will explode. One of the inefficiencies of this method is the sparsity of the data. Each column represents a word by having only one value as ‘1’ and the rest as zeros. What if, we can compress the data by converting the sparse matrix to dense which can still represent the same meaning. That is the core idea behind the word2vec.

word2vec(Continuous Bag-of-Word): In CBOW, the model tries to predict a missing word, given the context. For instance, the context ‘The dog **** the cat’. The objective of the model is to predict the missing word. The model is trained for a wide range of contexts until it can forecast the words given the context.

  • Once the model is completely trained, the layers till the hidden layer will be taken into consideration. The idea is, with the help of encoded output from the hidden layer, the whole meaning is retained and intact. This is one of the ways to create the dense matrix that holds the word representations along with the contexts.
Fig 2- shows a CBOW model with only one word in the context — original paper

word2vec(Skip-gram Model): The procedure is the reverse of the aforementioned CBOW. Here the target will be fed as the input whereas the context words become the model objective to learn. Once the model is trained to predict the surrounding words(context) with the help of the input word, the output layer will be cut off. Only the intermediate layers will be retained for further encoding and processing.

Fig 3 — shows Skip Gram model — original source

Another popular word embedding is glove which takes the context globally. I found this article contains a simple and clear explanation of glove.

Recommended Reading:

https://arxiv.org/pdf/1301.3781.pdf

https://arxiv.org/pdf/1411.2738.pdf

AI Enthusiast | Blogger✍

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store