In the previous post, we got a brief idea about Policy, now we’ll focus more on types of Policy. It is categorized into two types (i) Deterministic policy — where policy tells the agent to perform only one action in a particular state. An example could be, take a Robot playing chess, policy informs the Robot to take a specific action(movement) from the current step. Whenever the Robot comes to a particular spot, the next action will always be the same. (ii) Stochastic policy — the information passed to the agent is based on the probability of the action space. Let’s explore the Stochastic policy further.
The stochastic policy is further subdivided into two types. (i) Categorical Policy — where the action space contains a well-defined set of actions associated with the corresponding probabilities. If we take the chess example, whenever the chess piece is placed in a particular spot, it has a 90% likelihood to move forward, 5% left and 5%right. Here the action space contains a discrete set of actions. (ii) Gaussian Policy — the probability distribution is continuous over the action space. For instance, the speed of the self-driving car as the action space and the output is a continuous value.
Episode: A complete set of actions and states from the initial state to the final state. To cite an example, one chess game is one episode. When the Robot initially starts playing the game, with random policy(random actions) the total reward will be less because of the randomness. During the second game(second episode), the policy gets better and the Robot plays the game still better. So with every episode, the learning gets improved. After ’n’ number of episodes, the Agent finally learns the Optimal policy.
The task in Reinforcement Learning is sub-divided into two categories (i) Episodic where there is a start and end state(chess game) (2) Continuous task — no end state. An example is the household Robots which are always in action.
Return & Discount Factor: If we take the case of the episodic task, return is the sum of rewards obtained for the whole episode. Since episodic has finite termination we can compute the returns(using summation). In the case of continuous tasks, the return cannot be simply the sum of the individual rewards as the continuous tasks go on forever(with an infinite final state). For such scenarios, a discount factor is introduced.
The smaller value of the discount factor gives more importance to the current value and minimum weightage to the future values. In the case of a larger discount factor, the future values get the attention.
Value function: Also referred to as the return the system gets for the particular state(from that state onwards). For example, if we have 5 states in a single episode. The value function for state1 = Total return(from state1 onwards), state2 = Total return(from state2 onwards).
Since we are dealing with a random variable, we need to include the probability along with the action occurring. Consider the case of a stochastic policy with discrete action space(where every action is attached to the probability of occurrence). While computing the value function, the probability is multiplied with the reward received in each step and finally, all these computed values will be summed up. Also, the optimal policy is decided based on the value function returned for that policy.
Based on the value returned for the states, we can decide which state is the optimal one(the one having the highest return).