Day 70(ML) — Gradient Boosting(Classification)

Photo by Xavi Cabrera on Unsplash

Let’s untangle the complexity of the math behind gradient boosting for classification. The regression problem statement is comparatively easier to solve, as the loss function used in the case of continuous outputs is mean squared error. This made the computation of the gradient/derivative simpler (i.e) we just took the average of the values, considered as the optimum outcome.

As with any concept to gain a better understanding, let’s consider the below example to apply the logic.

odds & Log odds: Before actually jumping into the classification use case, let’s discuss the prerequisites (i.e) odds and log odds. The odds are referred to as the ratio of favourable outcomes / unfavourable results. To cite an example, out of 10 complex exams, the odds of passing is 4/6 (i.e) the number of exams that can be cleared is 4, the rest are undesired ones(failing). If we rewrite the odds in terms of probabilities, it would end in an equation like this p/(1-p) (i.e) (4/10) / (6/10) = (4/6). Usually, probabilities are computed using the total counts as a denominator.

The log-odds are used to break the asymmetry nature of the raw odd values. For instance, 1/5 = 0.2 whereas 5/1 = 5 but the same values if evaluated in terms of log-odds, we would get log(1/5) = -0.69 while log(5/1) = 0.69.

Gradient Boosting Classification: As with any concept to gain a better understanding, let’s consider the below example to apply the logic.

Step1 — Converting output probabilities to the log(odds): The main challenge in the classification is the loss function used for taking the derivative (i.e) log-loss = -(y * log(p) + (1-y) * log(1-p)) where ‘y’ is the expected value and ‘p’ is the predicted outcome. The negative sign placed in front of the equation describes the magnitude of cost.

log-loss = -{y * log(p) when y = 1 , (1-y) * log(1-p) when y = 0}

We’ll start with the log-loss equation,

loss= -(y * log(p) + (1-y) * log(1-p))

loss = — y * log(p) — (1-y) * log(1-p) => in the next step we’ll multiply & expand (1-y) with log(1-p)

loss = — y * log(p) — log(1-p) + y * log(1-p) => we can combine the ‘y’ terms

loss = -y * [log(p) — log(1-p)] — log(1-p) => applying the log division rule

loss = -y * log(p/1-p) — log(1-p) => log(odds) = log(p/1-p)

loss = -y * log(odds) — log(1-p)

Since we want the entire equation to be in terms of log(odds), let’s rewrite log(1-p) in terms of log(odds) as below,

log(odds) = log(p/1-p) => taking exponential on both sides

e^log(odds) = e^log(p/1-p) => e^log(any) = any

e^log(odds) = p/1-p

p = (1-p) * e^log(odds)

p = e^log(odds) — p * e^log(odds)

p + p * e^log(odds) = e^log(odds)

p(1+e^log(odds)) = e^log(odds)

p = e^log(odds) / (1+e^log(odds))

Now, substituting ‘p’ in the actual intended equation,

log(1-p) = log[1- e^log(odds) / (1+e^log(odds))]

log(1-p) = log[ ( 1 + e^log(odds) — e^log(odds) ) / (1+e^log(odds))]

log(1-p) = log[1/(1+e^log(odds))] => applying division rule

log(1-p) = log(1) — log(1 + e^log(odds)) => log(1) = 0

log(1-p) = -log(1 + e^log(odds))

Going back to the original log-loss equation,

loss = -y * log(odds) — log(1-p) = > replace it with the new value

loss = -y * log(odds) + log(1 + e^log(odds))

Step2 — Taking gradients/derivatives: Since this algorithm is based on taking the gradients of the loss function(similar to regression), let’s take the derivative w.r.t log(odds)

dloss/dlog(odds) = -y d [log(odds)]/dlog(odds) + d[log(1+e^log(odds))]/dlog(odds) => applying chainrule

dloss/dlog(odds) = -y + [ (1/1+e^log(odds)) * e^log(odds) ]

representation1(as a function of log(odds)): dloss/dlog(odds) = -y + [ (e^log(odds)) / (1 + e^log(odds)) ] => predicted probability

representation2(as a function of predicted probability): dloss/dlog(odds) = -y + p

These above two interpretations are interchangeable.

Step3 — Fidinging the initial optimal value: Now before forming stumps, we need to initialise the best optimum value similar to regression. The initial value is chosen in such a manner that it reduces the loss/cost.

loss = -y * log(odds) + log(1 + e^log(odds))

applying the ‘y’ value in the above equation for the respective actual values, taking derivative in terms of probability -y + p. For the target values ‘No’ is considered as ‘0’ and ‘Yes’ as ‘1’

  • -0 + p

Now, taking the summation of the individual training instances, we have the value as below,

(-0+p-1+p-1+p-1+p-0+p-0+p-1+p-1+p-0+p) = 0

9p — 5 =0

p = 5/9

converting the predicted probability in terms of log(odds) = log(p/1-p)

log(odds) = log(5/4) = 0.223

This will be the initial value.

Step4 — formation of first tree and the prediction: Similar to regression, now we need to compute the residuals which will be predicted by the first tree. We’ve already calculated the derivative interms of log(odds) which is,

dL/dlog(odds) = -y + [e^log(odds)/ ( 1 + e^log(odds)) ] => multiply by ‘-’ sign

dL/dlog(odds) = y — [e^log(odds) / (1 + e^log(odds))]

if we rewrite interms of probability,

dL/dlog(odds) = y — p

residual = y — p

residual = y — [e^log(odds) / (1 + e^log(odds))] => now substituting the log(5/4)

residual = y — [(5/4) / (1 + 5/4) ] = y — (5/9) = y — 0.55

Note: We’ll be building the first tree to learn the residuals. In our case, let’s assume ‘Expert_Level’ has better information gain.

Next step is to access the best value that minimizes the loss function. In all the above leaves we have three samples that got allocated and these values will be used to evaluate the optimal output that brings down the cost.

loss function is denoted as L(yi, Fm-1(xi) + gamma) (i.e) when we add the initial value with new gamma value(predicted), the overall cost should be minimized.

Since this is a loss function , we can replace this with the actual loss function we have derived (i.e) in the form of

-y * log(odds) + log(1+e^log(odds)) writing by including gamma

  • -y * [Fm-1(x1) + gamma] + log(1+e^Fm-1(x1) + gamma)

Using second order taylor equation, the derivative for the above can be written as,

L(y1, Fm-1(x1) + gamma) ~(approx.) = L(y1, Fm-1(x1)) + d/dF() [y1,Fm-1(x1)] * gamma + ½ * d2/dF()2 [y1,Fm-1(x1)] * gamma²

L(y1, Fm-1(x1) + gamma) ~(approx.) = L(y1, Fm-1(x1)) + d/dF() [y1,Fm-1(x1)] * gamma + ½ * d2/dF()2 [y1,Fm-1(x1)] * gamma²

Now, we can take derivative of this function w.r.t gamma

d/dgamma L(y1, Fm-1(x1) + gamma) = 0 + d/dF() [y1,Fm-1(x1)] + d2/dF()2 [y1,Fm-1(x1)] * gamma

now set the above term to zero and solve for gamma

d/dF() [y1,Fm-1(x1)] + d2/dF()2 [y1,Fm-1(x1)] * gamma = 0

gamma = — ( d/dF() [y1,Fm-1(x1)] / d2/dF()2 [y1,Fm-1(x1)] )

we already know the first derivative of the loss function, so we can plug it

numerator = Observed — e ^ log(odds) / (1 + e ^ log(odds) )

numerator = Observed — P

numerator = Residual

we have the solve for the denominator, which is the second derivative of loss function which is the derivative of the first derivative.

d/dlog(odds) d/dlog(odds) -Observed * log(odds) + log(1 + e^log(odds))

d/dlog(odds) -Observed + e ^ log(odds) / (1+e^log(odds))

rewriting the fraction as multiplication

d/dlog(odds) -Observed + (1+e^log(odds)) ^ -1 * e^log(odds)

now taking the derivative and for second term applying product rule

0 + -(1+e^log(odds)) ^ -2 * e^log(odds) X e^log(odds) + (1+e^log(odds)) ^ -1 * e^log(odds)

This long thing is the derivative of loss function, Technically with a little bit of algebra we have way easier to work with.

Rewriting the first term in fraction

-e²*log(odds) / (1+e^log(odds)) ^2 + e^log(odds) / (1+e^log(odds))

Making the denominator common,

-e²*log(odds) / (1+e^log(odds)) ^2 + e^log(odds) * (1+e^log(odds)) / (1+e^log(odds)) ^2

-e²*log(odds) / (1+e^log(odds)) ^2 + [ e^log(odds) + e²*log(odds) ] / (1+e^log(odds)) ^2

e^log(odds) / (1+e^log(odds)) ^2

split the denominator into two terms

e^log(odds) / (1+e^log(odds)) * (1+e^log(odds))

p = e^log(odds) / (1+e^log(odds))

log(1-p) = 1og(1/(1+e^log(odds)))

which means

1-p = 1/(1+e^log(odds))

So, we can rewrite this as

P * (1-P)

Now, back to gamma

Gamma = Residual / second derivative

Second derivation can be replaced with P * (1-P)

Gamma = Residual / P * (1-P)

for three values, the above statement could be modified as,

Gamma = sum(residual) / sum(previous probability * (1-previous probability) )

Previous probability will be same for all values at the starting point.

Gamma(Beginner) = (-0.55 + -0.55 + 0.45) / [(0.55 * 0.45) + (0.55 * 0.45) + (0.55 * 0.45)]

Gamma(Beginner) = -0.65 / [3 * (0.55 * 0.45) ] = -0.65 / 0.74

Gamma(Beginner) = -0.87

New prediction = previous log-odds + learning rate * Gamma

New prediction = 0.223 + (0.1) * (-0.87)

New prediction = 0.136

The new prediction(log(odds)) has to be converted into a probability for computing the residuals for the next rest. The same process gets iterated until the residual value becomes insignificant.

My learnings are from stat quest videos, strongly recommend checking out the below link for better clarity.

Recommended Video:

https://www.youtube.com/watch?v=StWY5QWMXCw

AI Enthusiast | Blogger✍

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store