10.3 Dynamic Programming
Created Date: 2025-07-10
The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP). Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically.
We usually assume that the environment is a finite MDP. That is, we assume that its state, action, and reward sets, \(\mathcal{S}\), \(\mathcal{A}\), and \(\mathcal{R}\), are finite, and that its dynamics are given by a set of probabilities \(p(s', r|s, a), for all \(s \in \mathcal{s}\) , \(a \in \mathcal{A}\)\) , \(r \in \mathcal{R}\) and \(s' \in \mathcal{S}^+\) (\(\mathcal{S}^+\) is \(\mathcal{S}\) plus a terminal state if the problem is episodic).
Although DP ideas can be applied to problems with continuous state and action spaces, exact solutions are possible only in special cases. A common way of obtaining approximate solutions for tasks with continuous states and actions is to quantize the state and action spaces and then apply finite-state DP methods.
The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies. In this chapter we show how DP can be used to compute the value functions defined in Chapter 3. As discussed there, we can easily obtain optimal policies once we have found the optimal value functions, \(v_*\) or \(q_*\),which satisfy the Bellman optimality equations:
10.3.1 Policy Evaluation (Prediction)
First we consider how to compute the state-value function \(v_{\pi}\) for an arbitrary policy \(\pi\). This is called policy evaluation in the DP literature. We also refer to it as the prediction
10.3.2 Policy Improvement
Our reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function \(v_{\pi}\) for an arbitrary deterministic policy \(\pi\). For some state \(s\) we would like to know whether or not we should change the policy to deterministically choose an action \(a \ne \pi(s)\).
We know how good it is to follow the current policy from \(s\) - that is \(v_{\pi}(s)\) - but would it be better or worse to change to the new policy? One way to answer this question is to consider selecting \(a\) in \(s\) and thereafter following the existing policy, \(\pi\). The value of this way of behaving is:
The key criterion is whether this is greater than or less than \(v_{\pi}(s)\) . If it is greater - that is, if it is better to select a once in s and thereafter follow \(\pi\) than it would be to follow \(\pi\) all the time - then one would expect it to be better still to select a every time s is encountered, and that the new policy would in fact be a better one overall.