Evaluation Metrics for Classification & Class Imbalance

Measuring the performance of a classification model(logistic regression — Day7) involves various yardsticks such as accuracy, precision, recall, f1-score, AUC, ROC and PRC. Let’s demystify each of the assessing techniques one by one, but before that, we need to understand class imbalance.

Table of contents:

  1. What is Class Imbalance?
  2. The usefulness of Confusion matrix
  3. Accuracy, Precision, Recall, f1-score
  4. ROC(Receiver Operating Characteristic)
  5. AUC(Area Under the Curve)
  6. PRC(PrecisionRecallCurve)

What is Class Imbalance? — Imagine we are collecting the data pertaining to whether a person has bought Tesla Roadster or not. Majority of the data gathered would be related to persons who have not gotten the car while only a few proportions of the people have actually purchased the Roadster. Let’s place the data in a visual count chart.

Fig1 — shows count plot of car purchase

Note: The term ‘class’ in ML indicates the target output of classification models.

From the above picture, we can tell the data is heavily imbalanced. This is denoted as ‘class imbalance’ in machine learning.

The usefulness of Confusion matrix: In any scientific experiments, researchers come up with a hypothesis that needs to be proven through multiple tests. The two hypotheses widely used in ML are null hypothesis and alternate hypothesis.

Alternate hypothesis: In statistical hypothesis testing, the alternative hypothesis is a position that states something is happening, a new theory is preferred instead of an old one(from wiki). In the aforementioned classification problem, the alternate hypothesis is the number of people who will purchase the car in future(a topic of interest for the company).

Null hypothesis: The counterpart of the Alternate hypothesis (i.e ) the number of people who will not buy the vehicle.

Fig2 — shows the confusion matrix for classification

The confusion matrix captures all the classified and misclassified records classwise. The class of interest is always treated as ‘Alternate Hypothesis’ and the other one is considered as ‘Null Hypothesis’.

Helpful Statistical terms used in ML(explained with the car example):

True Positive — When the expected value is ‘Yes’ and the predicted value is also ‘Yes’.

True Negative — When the expected value is ‘No’ and the predicted value is also ‘No’.

False Positive(Type I error) — When the expected value is ‘No’ but the predicted value is ‘Yes’.

False Negative(Type II error) — When the expected value is ‘Yes’ but the model has predicted it as ‘No’.

Accuracy, Precision, Recall, f1-score: The Accuracy for a class is calculated based on the correct observations made with respect to the total observations.

Accuracy = Correct observations / Total number of observations

TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative

Overall Accuracy = (TP + TN)/(TP + TN + FP + FN)

Classification error rate = (FP + FN) / (TP + TN + FP + FN)

The robustness of the model cannot be determined only by the accuracy, as the overall accuracy fails to capture the prediction power in the case of an imbalanced case. To illustrate, consider the scenario when 99% of the data belongs to the ‘negative’ category and just ‘1%’ contributes to the positive one. If the model forecasts ‘negative’ for all the training samples, then the accuracy of the model will be 99%. Such models are purposeless(even though the accuracy rate is super high), the model breaks down for all the positive samples.

In order to overcome the shortcoming of accuracy, we evaluate Precision and Recall.

Precision — Out of the total predicted ‘Positive’ values what is the actual positive.

Precision = TP / (TP + FP)

Recall or Sensitivity or True Positive Rate — Out of the total actual ‘Positive’ values what is the predicted Positive.

Recall = TP / (TP + FN)

By combining Precision and recall we try to balance the Type I and Type II error and this is denoted by f1-score.

f1-score = (2 * Precision * Recall) / (Precision + Recall)

Fig3 — shows a sample confusion matrix

Accuracy = 93/ 100 = 93%

Precision = 1/ 3 = 0.33

Recall = 1/ 6 = 0.16

f1-score = ( 2 * 0.33 * 0.16) / (0.33 + 0.16) = 0.1 / 0.49 = 0.2

f1-score = 20%

If we just go with the accuracy alone, then the model is outperforming but looking at the f1-score the model’s discriminating strength is under the benchmark level. So, for classification type of problems, all these metrics are looked through before deciding the model’s performance.

Some more terms related to the accuracy,

Specificity or True Negative Rate = TN / (TN + FP)

False Positive Rate = FP / (TN + FP)

False Negative Rate = FN / (TP + FN)

ROC(Receiver Operating Characteristic) — It is a graph drawn between True positive rate and the False positive rate. The curve is formed by varying the threshold of the model and finding out the corresponding true positive rate and the false positive rate. The ROC is immune to the count of the class values as all the four possible combinations (TP, TN, FP, FN) have been included during the creation of the image.

Fig4:- shows the ROC for different thresholds for the same model

Conservative Models:- (0, 0) Both the True and False positive rates are equal to zero implying that the model classifies all the records as strictly negative.

Liberal Models:- (1,1) Both the True and False positive rates are equal to 1 meaning that the model classified all the records as strictly positive.

Ideal Models:- (0,1) True positive rate is 1 whereas False positive rate is 0. This is the desired model and practically such models are not possible and there is always some penalty issued in terms of Type I and II errors.

Fig 5:- shows the curve drawn with different thresholds

The threshold point is chosen in such a way that the change in the true positive rate is higher when compared to the false positive rate.

AUC(Area Under the Curve) — Based on the ROC values plotted, if the area under the curve is larger, then the model will be preferred when compared to the rest of the models. Usually, AUC is drawn for all the models and the model which has the maximum area covered would be preferred.

Fig6:- show AUC for different models

The AUC and ROC are basically used for classification problems with probability outputs. In reality, it is not plausible to have zero errors because the distribution of the classes always overlap and the overlapping points end up in misclassification.

PrecisionRecallCurve(PRC) — It is also used to select the best performing models among the available ones. The X-axis will take the ‘recall’ and the Y-axis will be ‘precision’.

Fig 7:- shows the precision-recall curve

In the case of the PRC curve, the curve that is very close to the point (1,1) would be preferred. Because of the objective, trying to maximize both precision and recall (i.e) minimizing both Type-I and Type-II errors.

Recommended Reading:

AI Enthusiast | Blogger✍