Day 174(RL) — Dynamic Programming(Policy iteration)

Photo by Appolinary Kalashnikova on Unsplash

In this post, we’ll discuss two main topics (1) How to extract optimum policy from optimal value (2) Policy Iteration(another variant of DP). Let’s start with extracting optimal policy from the value. In the earlier post, we’ve seen value iteration, where the actions are spun across the Q function to figure out which action produces high-value outcomes.

Once we have the optimal value by figuring out which action for which state results in a maximum reward, the next step will be to map the optimal policy corresponding to the value. We already know that policy is basically an agent’s behaviour that reflects which action to take in order to maximize the reward. In the case of the value function, the policy is what is the best action to take in which state.

Policy Iteration: The main difference between the policy iteration and the value iteration is how we compute the value function. Now, we clearly know the value iteration part(spinning over different actions to compute the maximum Q value). Let’s focus more on the policy iteration, where the value function is computed with the random action just only once.

Let’s understand the concept better with a simple example, say we have 4 states S1, S2, S3 & S4. The possible actions are left, right, top and bottom. If we were to use the value iteration, then we try to map each action to the state to derive which one gives the maximum value. On the other hand, in the case of policy iteration, we initially assign actions as follows S1-left, S2-top, S3-right and S4-bottom. If we observe closely, there is no looping over the actions and only a random action for each state.

Now the next natural question would be, if we have only the random policy to compute the value, then how do we reach the optimal policy at all?. The previous value will be used for the subsequent steps when computing the policy/value. Slowly, the model converges to the optimal value. But when do we stop the iteration or how do we know the iteration has been saturated?. We compare the current policy with the previous policy(or value with previous value), if there is no significant difference between those two, we can conclude the policy reached the optimal level.

Recommended Reading:

AI Enthusiast | Blogger✍