In order to develop a good fundamental understanding of deep reinforcement learning, there are certain key concepts that we need to be familiar with. We will start exploring some of them in this part of the article.
In the first part of the article, we postulated that from a technical perspective, the field of deep reinforcement learning (DRL) is about building algorithms that can solve complex problems pertaining to decision-making under uncertainty. In the DRL world, these algorithms are referred to as agents. An agent is the decision maker and nothing else. This basically means that if you are training a warehouse bot to navigate the warehouse, the bot is not actually the agent in any form. Only the algorithm that is making the decisions can be referred to as the agent.
The next important concept is that of the environment. In DRL, everything that is outside the realm of the agent is the environment. The agent has no total control over the environment. If we revisit our warehouse bot example where the bot has to navigate to move a shelf of products, the shelf that needs to be picked up, where the shelf is right now, and where the shelf needs to move are examples of environment. The agent has no control over these.
Remember that the warehouse bot is actually also part of the environment because we mentioned that it is not part of the agent. It may seem a bit confusing that even though the algorithm or the agent decides the movement of the warehouse bot, warehouse bot is part of the environment and not a part of the agent. However, the gist is that the algorithm, or the agent, has just one task, and that task is to make decisions. Everything that happens after the decision and every entity that comes into play because of the decision, can be put under the environment bucket.
As you can imagine, an environment is essentially a representation of a collection of variables that define the problem. For example, in our warehouse robotics scenario, the location and the travel speed of the bot would be part of the variables that constitute the environment. A comprehensive set of variables and the entire gamut of values that these variables can take, are referred collectively as the state space. A particular state is, hence, an instantiation of these state spaces; essentially, a specific set of values that the variable can take constitutes a specific state.
An interesting aspect of DRL is that the agent does not always has access to the entire environment. It may observe only a specific part of the state, that specific part is called an observation. These observations depend on states but essentially are what agents can observe or see. As an example, in the warehouse robotics scenario, the agent may only have access to coordinates on a grid as locations of racks/shelves that need to be moved. So while an exact location of each rack with objects exists, the agent does not have access to this state. You can say that the observations that the agent perceives are derived from these states but are not states that the agent has directly observed.
At each state, the dynamics of the environment present a set of actions that the agent needs to choose from. It should be obvious that the agent can influence the environment through the actions that it chooses. Hence, because of the decisions taken by the agent and the subsequent impact, the environment may change states . We call this phenomenon “the environment responding to the agent’s action”. The process of mapping how the environment will respond to agents’ actions is called the transition function.
Sometimes, the environment may propagate a reward signal as a response to an agent’s actions. Just like the transition function can map how the state of an environment can change because of an agent’s actions, the reward function helps map the process of an environment providing a reward signal as a response to an agent’s action. Together, the transition and reward functions are referred to as the model of the environment.
The third part of this article will continue the discussion of the environment and the reinforcement learning cycle. It will be published on July 26th.

