Usage of wordnet library: Wordnet library can be considered as a dictionary of synonyms. We can leverage this python package to validate whether the word is a proper English word. If it is a junk token, then it can be ignored thus reducing the critical word count.
Let’s consider a sample text,
Text = 'skype problem Requires approval erp details'
we can say the words ‘skype’ and ‘erp’ are non-English words and more like technical terms. Now, we can use the wordnet and verify whether the given word belongs to the English language or not.
from nltk.corpus import wordnetdef check_english(x):
nonenglish_list =  for word in x.split(' '):
if not wordnet.synsets(word):
nonenglish_list.append(word) return ' '.join(nonenglish_list)
The command wordnet.synsets checks for validity. When the above set of lines are executed, we could notice the words Skype & Erp gets captured in the non-English list of words.
As per the expectation, the two words get printed in the output.
Lemmatization: As we’ve already discussed the lemmatization process fetches the root word thus downsizing the word count by retaining only the distinct ones. Similar to nltk, we have another package to deal with NLP requirements (i.e) Spacy. For this experiment, we’ll incorporate the lemmatization functionality from the Spacy package.
For instance, the input text = “sometimes i wonder walking is better than driving”
#use spacy lemmatizerimport spacy
nlp = spacy.load('en')def spacy_lema(x):
doc = nlp(x)
return ' '.join([token.lemma_ for token in doc])spacy_lema(Text)sometimes i wonder walk be well than drive
If we observe closely, wondering & driving have been replaced with the respective root words wonder and drive.
The entire code can be found in the Github repository.