Understanding Deep Reinforcement Learning (DRL) : Part X of XI

In the first part of the article, we reviewed the field of Deep Reinforcement Learning (DRL) and started exploring some fundamental terminology in the second and third parts. In the fourth part, we explored the temporal credit assignment problem, and the fifth part focused on the exploration vs. exploitation trade-off. In the sixth part, we explored the Markov decision process (MDP) and its role in reinforcement learning. We started exploring a detailed example in the seventh part that we continued in the eighth part. In the ninth part, we explored the role of reward functions. In this part, we will explore the role of time horizons in MDPs and the concept of discounting.

Time representations can also be captured in MDPs. As stated in one of the previous parts, the fundamental unit of capturing time in a MDP environment is a time step. There are many different terms used for a time step, like an epoch, a cycle, an iteration, and in some cases, an interaction. But the gist is that a time step in a MDP is essentially a clock that is global to the environment, syncs all entities in the environment and makes time discrete.

We cannot have an episodic task if there was no concept of time steps. The same goes for continuing tasks, tasks that do not terminate since they do not have a terminal state. Both episodic and continuing tasks can be defined from the perspective of the agent, leveraging an approach commonly referred to as “using a planning horizon”. As you can imagine, a finite horizon is a planning horizon in which we know that the objective will be met in a finite number of time steps.

Greedy horizon is one of the types of planning horizon, or a type of finite planning horizon that can terminate in a single time step. If you are familiar with reinforcement learning environment types, you know that all bandit environments fall within the realm of greedy horizons.

Infinite horizons obviously time horizons in which there is no predetermined time step limit for the agent. The agent has to plan for an infinite number of time steps. Note that task within the realm of infinite horizon can still be episodic and may have a terminal state. But when it comes to agent’s perspective, it may seem infinite planning horizon. This specific type of infinite planning horizon is also referred to as an indefinite horizon task. This means that the agent may plan for infinite time steps, but the interactions can be stopped at any time by the environment.

There may be an infinite task environment in which there is a high probability that the agent may get stuck in a loop, and the environment may never terminate. In these circumstances, a best practice is to add a terminal state artificially, based on the time step. This time step is a hard time step limit, leveraging the transition function.

When thinking about infinite time horizon tasks, we need to provide the agent some form of incentive to finish the task sooner rather than later. And the approach that comes to the rescue is discounting. The motto of discounting can be described in a short sentence” The future is uncertain, value it less. We would not like to embrace this motto when it comes to business strategy formulation 😄 but it is a very efficient approach when it comes to infinite horizon tasks in reinforcement learning.

The approach of discounting, at a high level, is to discount the value of rewards over time to signal to the agent that getting a + 1 is better sooner than later. The number used for discounting is called the discount factor, or gamma. It adjusts the significance of the reward over the time horizon. A reward in the future is less attractive then the reward in the present.

There is another important reason for using the discount factor. Discount factor also helps us reduce the variance of return estimates. Since the future is uncertain, the more you try to factor the future into your calculations, the more stochastic we get. And this introduces more and more variance into our value estimates. The discount factor helps us the magnitude of the impact that future rewards would have on our value function estimate, thereby helping us stabilize the agent’s learning.

In the last part of this article series, we will overview some extensions to MDPs and then conclude the article series with some closing remarks. The final part will be published on 08/06


Leave a comment