Day 111(DL) — NLP Data Preprocessing — Part 1

In the previous article, we got a grip on regular expression and lower case conversion. The next interesting steps will be to further refine the input text before the actual model training. A couple of subsequent posts will have a series of preprocessing steps for the input data.

Table of Contents:

  • Pycontractions
  • Stop words removal
  • Word display using Word cloud

Pycontractions: We usually follow an informal tone while communicating on social media such as Twitter or leaving a review comment on Amazon or other online stores. We tend to use a lot of contractions to keep the text simple & effortless. Some of the examples for contractions include I’d, I’ll, I’m etc. But the machines will not understand such words directly. We can use the package Pycontractions to expand these contractions to proper words.

Some examples can be found in the original Pycontractions python package.

Stop words removal: The stop words in texts are commonly occurring words, even excluding them will not alter the meaning of the sentence. We have a set of predefined stop words in the nltk library. In addition to the standard ones, we can always include additional stop words/ remove the existing ones depending upon the requirement.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_list = stopwords.words('english')
print(stop_list[:5])
['i', 'me', 'my', 'myself', 'we']
  • As we can see from above, the stopwords package would return the list of words that are treated as stop words by default. In case if we would like to remove some, for instance ‘no’ and ‘not’ as these words convey negative meanings and completely eliminating them would change the context of the text. For such circumstances, we can use the ‘remove’ option to delete the words from the list.
remove_list = ['no', 'nor', 'not']
for i in remove_list:
stop_list.remove(i)
  • The next scenario to consider is when we want to add some new words to the default list. We can use the ‘extend’ option to achieve the same.
print('length of the list before additions:', len(stop_list))#Include the greetings as wellextra_list = ['sorry', 'please', 'kindly', 'good', 'morning', 'afternoon', 'evening','thank', 'thanks']stop_list.extend(extra_list)
print('length of the list after additions:', len(stop_list))
length of the list before additions: 176
length of the list after additions: 185
  • Now we can take a sample text and leave out the stopwords from the text by using a simple tokenize command.
import nltk
nltk.download('punkt') # download the tokenizer
def stopword_remove(x):
list1 = [str(word.lower()) for word in nltk.word_tokenize(x) if word not in stop_list]
return ' '.join(list1)print('Before removing stop words:', Text)
print('\nAfter removing stop words:', stopword_remove(Text))
Before removing stop words: They are very common in conversational spoken English.

After removing stop words: they common conversational spoken english .

As we can notice, after applying stopwords, the common words such as are, very, in are ruled out.

Word display using Word cloud: Sometimes, we would like to have a glance at the high frequency of low-frequency words to make a critical decision regarding the word retainment. For instance, we are working on dealing with IT Tickets, understanding the high-frequency words give a clue on the more problematic area. To assist in the visualisation process, the word cloud could be utilised.

Fig 1 — shows output from word cloud
from tensorflow.keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS

Let’s consider the input text file consists of 6 sentences and we’d like to visualize the critical words present in those texts.

def display_words(x, y): #x => description, y=> number represent how many wordstokenizer = Tokenizer(lower=True, split=' ', char_level=False, oov_token=None, document_count=0)tokenizer.fit_on_texts(x)word_list = tokenizer.word_indexprint('The length of the word list:', len(word_list))#extract keys and store it in list in pythondict_keys = word_list.keys()
key_list = []
for key in dict_keys:
key_list.append(key)
print(key_list)
if y > 0:wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='salmon', colormap='Pastel1', collocations=False, stopwords = STOPWORDS).generate(text = ' '.join(key_list[:y]))else:wordcloud = WordCloud(width = 1000, height = 500, random_state=1, background_color='salmon', colormap='Pastel1', collocations=False, stopwords = STOPWORDS).generate(text = ' '.join(key_list[y:]))plt.figure(figsize=(40,30))
plt.imshow(wordcloud)
plt.axis('off')

The count of words can always be adjusted according to the use case.

Text = text_df['error_description']
y = 0
display_words(Text, y)

The entire code can be found in the Github repository.

Recommended Reading:

https://pypi.org/project/pycontractions/

AI Enthusiast | Blogger✍

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store