Dynamic Programming is model-based learning to find the optimal policy based on the known dynamics of the environment (i.e) transition probabilities. We have two variants of DP including (1) Value iteration (2) Policy iteration. In this post, our focus will be on Value iteration where the optimal policy will be based on the maximum value function. The previous articles Day 169 & Day 170 can be referred to for the optimal value function and the Q function (i.e) the Bellman equation.
As we already know, the optimal value function can be found by iterating over the Q function over all the actions and finding out the one that maximizes the output. To understand this clearly, let’s take a simple example we have a state ‘s’, taking an action ‘0’ will result in the value 5 and action ‘1’ will give 2. In this case, we can immediately say the action ‘0’ is the optimal one for the state ‘s’ as it yields the highest value when compared to the other one. Similar way, we could iterate over the other state actions to figure out which action is the optimal one for each state present in the environment.
Now expanding the same example, when we have multiple states, say s1, s2 & s3 and we have two actions 0,1. In the state s1, action = 0 gives the maximum value, s2 , action = 1 results in the maximum outcome and s3, action = 0 is the optimal one. So the underlying point is which action for which state gives the best when compared to the rest. That is the core idea behind finding the optimal value.
We also need to remember the transition probabilities corresponding to the state, action and the next state(i.e stochastic nature). The same state-action pair will not always result in a particular next state s’, instead 10% s2-state, 70% s3-state and 20% s4-state. So while computing the total value for a particular state-action pair, we also need to include the transition probabilities.
Here are the steps that will be followed, first, we compute the Q value for each state-action pair. Then the maximum value of the Q function iterated over the actions will be found. One point to be noted here is, the discount factor can be taken as ‘1’ and the next state Value (i.e) V(s’) will be either zero or some random value as we do not know this information for the first run.
Let’s again take a simple example to understand the stochastic property clearly, we have one state ‘S’ and there are two possible actions for that state (i.e) 0 and 1. The next state can be three states S1, S2, S3 and the corresponding transition probabilities for each action is given below and the corresponding rewards S1=1, S2 = 0 and S3=-1.
state ‘S’ and action ‘0’ = (0.2) * S1 + (0.7) * S2 + (0.1) * S3
Now substituting the rewards in the above equation we get, 0.2 + 0 + (-0.1) = 0.1
state ‘S’ and action ‘1’ = (0.7) * S1 + (0.2) * S2 + (0.1) * S3
Now substituting the rewards in the above equation we get, 0.7 + 0 + (-0.1) = 0.6
When we compare the two values, the later one result in the higher value which is 0.6. So the action ‘1’ for the state ‘S’ will be an optimal choice.
The above process is for the initial iteration with randomized next state values. But using the value of the states computed in the first iteration, the subsequent iteration will have the updated values of each state thus slowly reaching towards achieving the optimal value.