What is the Markov decision process?
The Markov decision making process – or MDP as it can be abbreviated to – was created by the Russian mathematician Andrey Markov in the 1950s, but is still used today in areas such as AI, robotics, automation, and manufacturing. The Markov decision process is what’s known as a stochastic process, which means that some variables in your system are always going to be random, which means that you can’t predict the patterns your system produces exactly, but can still be analysed statistically.
In order to make a more effective decision about the next most optimal action, the MDP takes into consideration certain variables that can affect an outcome of the process such as an agent’s actions, the environment they’re acting in, rewarding of actions, and the state that the agent is in. In this next section, therefore, we’ll look at what these terms all mean.
MDP analogy for understanding terminology
Your agent is the decision-maker. This means whatever you’re using as your agent can choose to take an action. Let’s look at a simple example below to gain an understanding of the process:
Imagine you’re looking for different kinds of berries in the forest. You are the agent here, because you’re the one with the ability to take actions. The environment you’re in (the forest) provides different options for you to take actions – the different kinds of berries. You choose one type of berry over another – this is taking an action. By performing this action, you receive what’s known in the Markov decision process as a ‘reward’ in correspondence with the action you’ve just taken.
By taking an action in an environment, you have entered a state (the way you’ve addressed the situation – choosing to take a berry). The action has changed the state of the environment you’re in. If you choose a berry that’s not harmful, you’ve received a reward that confirms you’ve taken the optimum action. As this is the most optimum action in that the reward is beneficial rather than harmful, you’ll keep repeating the process over a period of time. Each time you choose an action, however, the extent of the reward can be different, meaning that there’s a probability element to it each time. The idea is to increase the ‘overall’ reward of each future state.
The same way of ‘learning’ about optimal actions in an unpredictable environment can be applied to reinforcement learning. To reinforce the idea, unpredictable environments mean that each time you make an action, you can’t always predict how much you’ll be able to maximise your rewards each time, but you can use the present state (the accumulation of all past decisions) in order to inform your next action. This comes back to the idea that in these processes, you can’t predict the pattern that it’s going to produce exactly.
This leads us to defining what a Markov property is, as it may help you to understand the importance of the present state in the MDP.
What is the Markov property?
A Markov property is a property that means that any future states must be affected by an action taken in the present (which incorporates any necessary information ‘learned’ from past states). This means that the rest of the past state is ‘forgotten’ in future states. In our example above, it would mean that choosing the most optimal action in the present state informs any future states. Essentially, the evolution of a process ‘forgets’ any past states.
Why is the MDP useful?
The MDP is useful to employers who are utilising AI tools that have probabilistic dynamics to them. This means that the machine or agent you’re using are working in an environment which may have unpredictable results, but you need to be able to help make future decisions that can return greater rewards next time the machine or agent completes an action.
Using MDP in AI
In the case of reinforcement learning, the goal is to maximise the number of rewards through the completion of actions. Using an MDP means that you can combine exploring new options while also working out what actions provide the best rewards, meaning that you can work out what decisions lead to the most optimum rewards while still exploring other possible actions.
Real-world examples of using MDP
While the MDP is often used in reinforcement learning, there are many other ways that businesses can utilise this process. Below are some real-world examples of where you might use MDP:
- In robotics, planning paths and making sure that robots learn how to avoid obstacles;
- Helping autonomous vehicles to make real-time decisions within their environment;
- Financial speculation;
- Customer decision-making and loyalty to your company brand;
- Optimising resource allocation in cloud computing.
What are the challenges of using the Markov decision making process?
If you’re looking to speculate about the outcome of a decision, such as in financial speculation for example, then the Markov decision making process can be useful. That’s because in these circumstances, you don’t necessarily have to define an exact outcome, but rather, can account for random variables. It cannot give you information, however, on why future states arise, instead just providing you with a conditional probability of what might happen in the future based on your current state.
Assessing future states only based on a present state also has its limitations. For example, if your robot stops working one day, the MDP won’t be able to provide you with the reason behind it. It’s only going to give you a probability regarding whether it may or may not break down. This means that you’ll need to be able to gather additional information about the tools and systems you use if you want to find out why something isn’t working as well as expected.
The Markov decision making process can be a useful way to predict which choices lead to maximum rewards. This makes it a useful tool in AI reinforcement learning, financial speculation, and learning more about customer choice probabilities. However, one of the major drawbacks of this process is that it can’t provide explanations to any predictions you make. Consider using the Markov decision making process alongside other tools in your arsenal, such as troubleshooting or using the design process to look at possible weaknesses or faults (if using the MDP in engineering).