10.2 Markov Decision Processes
Finite Markov Decision Processe
Created Date: 2025-07-10
In this section we introduce the formal problem of finite Markov decision processes, or finite MDPs. This problem involves evaluative feedback, as in bandits, but also an associative aspect—choosing different actions in di↵erent situations. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards.
Thus MDPs involve delayed reward and the need to tradeoff immediate and delayed reward. Whereas in bandit problems we estimated the value \(q_*(a)\) of each action \(a\), in MDPs we estimate the value \(q_*(s, a)\) of each action \(a\) in each state \(s\), or we estimate the value \(v_*(s)\) of each state given optimal action selections. These state-dependent quantities are essential to accurately assigning credit for long-term consequences to individual action selections.
MDPs are a mathematically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. We introduce key elements of the problem’s mathematical structure, such as returns, value functions, and Bellman equations.
We try to convey the wide range of applications that can be formulated as finite MDPs. As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability.
10.2.1 The Agent–Environment Interface
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent.
The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.
More specifically, the agent and environment interact at each of a sequence of discrete time steps, \(t = 1, 2, 3, \cdots\) . At each time step \(t\) the agent receives some representation of the environment’s state, \(S_t \in S\), and on that basis selects an action, \(A_t \in \mathcal{A}(s)\) . One
10.2.2 Goals and Rewards
In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special signal, called the reward , passing from the environment to the agent. At each time step, the reward is a simple number, \(R_t \in \mathbf{R}\) . Informally, the agent's goal is to maximize the total amount of reward it receives. This means maximizing not immediate reward, but cumulative reward in the long run. We can clearly state this informal idea as the reward hypothesis:
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).
The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning.
10.2.3 Returns and Episodes
So far we have discussed the objective of learning informally. We have said that the agent's goal is to maximize the cumulative reward it receives in the long run. How might this be defined formally? If the sequence of rewards received after time step \(t\) is denoted \(R_{t+1}, R_{t+2}, R_{t+3}, \cdots\) , then what precise aspect of this sequence do we wish to maximize?
In general, we seek to maximize the expected return , where the return, denoted \(G_t\) , is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards: