So far we’ve extensively discussed Image processing and the corresponding subfield Computer Vision. It’s time to switch the gears to yet another interesting topic in deep learning (i.e) NLP stands for Natural Language Processing. The core theme of NLP is to make computers understand human languages.
The wide range of NLP applications include chatbots, automatic voice response systems, language translation, speech to text conversion and many more. The most popular NLP Robots which we see in our daily lives are Alexa, google translators to name a few.
Why the need for computers to understand the language? Starting from customer reviews on online products, Twitter comments, and other communications happening in various social platforms need interpretation. With the humongous volume of data exploding day by day, we need to process all the available information swiftly to uplift peoples’ lives. For instance, if someone is on a job hunt, instead of manually searching for the jobs, a simple NLP based model(using keywords) would fetch the available opportunities within a few seconds.
Basic Terminologies used in NLP:
- Corpus: A collection of entire text data. For instance, if a student is working on a project then the complete thesis is a corpus. In the case of product reviews, the entire collection of customer reviews form the corpus.
- Document: The individual items in the corpus is referred to as a document.
- Sentences: The sentences combined together to form a document.
- Tokens(similar to words with a small difference): The building blocks of sentences. Tokens are similar to words including numbers as well. The smallest block in the NLP is also the tokens.
- Stopwords: Commonly occurring words(is, are, am, etc), excluding them, will not alter the contextual meaning of the sentence.
- POS(Parts Of Speech): noun, pronoun, verb, adjective and adverb
- Stemming: The process of finding the root word of the given token. To cite an example token = running, after stemming = run. Stemming helps to reduce the token count by retaining only the root word. A document may contain ran, running, run which all imply the root word run. So instead of having 3 token ids, we could just have 1.
Stemming is achieved by having conditional codings such as if the word ends with ‘ing’ then it could be removed. Stemming is an inflexible approach, not always produces a meaningful result.
- Lemmatization: The objective of the lemmatization is also the same as Stemming. But the process flow is different and very proper. It tries to find the synonym/meaning of the word and then it actually replaces it with the root word. The entire procedure is time-consuming but the outcome is more accurate when compared to Stemmer.
- Tokenization: Since computers cannot understand raw tokens directly, we should convert the tokens into numbers. Each unique token is assigned with a running sequence number and this process is referred to as tokenization.
“Wow! This approach is really quick and super smart”
Wow = 1, This = 2, approach = 3, is = 4, really = 5, quick = 6, and = 7, super = 8, smart = 9.
These are a few interesting terms in NLP. We’ll explore more in the upcoming articles.