In the first part of the article, we reviewed the field of Deep Reinforcement Learning (DRL) and started exploring some fundamental terminology in the second and third parts. In the fourth part, we explored the temporal credit assignment problem, and the fifth part focused on the exploration vs. exploitation trade-off. In the sixth part, we explored the Markov decision process (MDP) and its role in reinforcement learning.
We started exploring a detailed example in the seventh part that we continued in the eighth part. In the ninth part, we explored the role of reward functions. In the tenth part, we explored the role of time horizons in MDPs and the concept of discounting. In this final part, we will conclude our discussion of Deep Reinforcement Learning.
MDP Extensions
As we can assume, the real-world problems that we want to solve leveraging DRL do not fit concretely within the mould of conventional MDP. To accomodate this, there are many extensions to MDPs. We will review some of them here but we have to keep in perspective the fact that this is not an exhaustive list of extensions.
Scenario: Agent is unable to fully observe the environment state.
Extension: Partially Observable Markov Decision Process (POMDP)
Scenario: Really large MDPs
Extension:: Factored Markov Decision Process (FMDP) can represent transition and reward function more compactly
Scenario: : One of the key elements, like action, time, state, or a combination of these are continuous
Extension: Continuous Markov Decision Process (CMDP)
Scenario: Both probabilistic and relational knowledge is involved
Extension: Relational Markov Decision Process (RMDP)
Scenario: Abstract actions that can take multiple time steps to complete are involved
Extension: Semi-Markov Decision Process (SMDP)
Scenario: Multiple agents are involved in the same environment
Extension: Multi-Agent Markov Decision Process (MMDP)
Scenario: Multiple agents need to collaborate and maximize a common reward
Extension: Decentralized Markov Decision Process (Dec-MDP)
Conclusion
In the previous ten parts of this article series, we explored the components of any reinforcement learning problem and understood how they relate and interact with each other. We also introduced Markov Decision Process (MDP) and understood what the process entails and how it works. Then, through an example, though a simple one, explored how we can represent sequential decision-making problems as MDPs.In tandem with deep learning, reinforcement learning, in my opinion, is the answer to making decision-making automation opportunities that exist in planning areas, across functions. Not only can these algorithms bring productivity and accuracy that is not possible by humans alone, but they can also find the most optimal solution.

