Day 116(DL) — Implementation of RNN(Sentiment Analysis)

Let’s implement a Sentiment Analysis model that takes financial data and predicts whether it is a positive result or not. For our experiment, we can pick the sample financial data available in Kaggle. The Sentiment types include negative, positive and neutral depending upon the financial results. We’ll include only negative and positive samples to make a typical binary classification model.

We’ll start with the preprocessing steps,

step1: Reading the input data by giving encoding = ‘latin-1’.

all_data = pd.read_csv('all-data.csv', encoding='latin-1')

Step2: Displaying the first 5 records of the data.

all_data.head()

Step3: Let’s use a count plot and see the distribution of the data between the sentiments.

sns.countplot(all_data['Sentiment'])

Step4: Let’s drop the texts corresponding to neutral and retain only the positive and negative sentiments.

sentiment_df = all_data[all_data['Sentiment'] != 'neutral']

Step5: Converting all the text into lowercase.

sentiment_df['Text'] = sentiment_df['Text'].apply(lambda x: x.lower())

Step6: Apply pycontractions to expand if there is any contracted words.

#before applying regular expression, let's expand the contractions

Step7: We’ll apply the punctuation removal, stopwords, wordnet as per the preprocessing steps which we’ve already discussed.

Step8: We need to decide on the length of the text given to the language model. Since each text is of varying length, for shorter sentences we may have to pad zeros to compensate for the full length. The length of the input cannot be varying, it should be maintained as a constant value.

print('maximum length of the text:', sentiment_df['length'].max())
print('minimum length of the text:', sentiment_df['length'].min())

we can remove the records with length less than 3.

final_data = sentiment_df[sentiment_df['length'] >=3][['Text', 'Sentiment']]

Step9: We can replace the positive value with ‘1’ and negative with ‘0’.

final_data['Sentiment'] = final_data['Sentiment'].replace({'positive':1, 'negative':0})

Step10: Before we start the training process, let's split the data into train and test(or validation).

X = final_data['Text']
y = final_data['Sentiment']

Step11: Since the model cannot accept the words, the text has to be converted into numbers.

dist_list = []

Now that we’ve assigned numbers to each unique word, we can replace the actual text with numerical values.

#let's map the input word with the number
def process_input(x):
final_value = []
split_x = x.split(' ')
for word in split_x:
value = word_dict[word]
final_value.append(value)
return final_value

Step12: Zero padding for the unfilled spaces. Since the maximum length of the text is 22, we’ll perform zero-padding for the shorter length sentences.

max_len = 22

Step13: Using the word embedding layer to compress the data.

#now we have replace the words with numbers, we'll use embeddings to compress the data

The embedding layer will take the input and perform the word embedding.

Step14: Finally, using model train we can fit the model followed by prediction.

test_x[:5], test_y[:5]

The above data corresponds to the actual test details. The fourth record corresponds to negative sentiment whereas the rest below to positive results. Let’s check whether the designed model does the correct forecast.

model.predict(val_final[0:5])

The model has correctly predicted with the 4th record as negative and the rest as positive.

The entire code can be found in the Github repository.

Notes: we can always play around with the preprocessing steps depending upon the use case.

Recommended Reading:

AI Enthusiast | Blogger✍

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store