10.3 DQN
Reinforcement Learning (DQN) Tutorial
Created Date: 2025-06-16
This tutorial shows how to use PyTorch to train a Deep Q Learning (DQN) agent on the CartPole-v1 task from Gymnasium. You might find it helpful to read the original Deep Q Learning (DQN) paper.
10.3.1 Task
The agent has to decide between two actions - moving the cart left or right - so that the pole attached to it stays upright. You can find more information about the environment and other more challenging environments at Gymnasium’s website.

Figure 1 - Cart Pole
As the agent observes the current state of the environment and chooses an action, the environment transitions to a new state, and also returns a reward that indicates the consequences of the action. In this task, rewards are +1 for every incremental timestep and the environment terminates if the pole falls over too far or the cart moves more than 2.4 units away from center. This means better performing scenarios will run for longer duration, accumulating larger return.
The CartPole task is designed so that the inputs to the agent are 4 real values representing the environment state (position, velocity, etc.). We take these 4 inputs without any scaling and pass them through a small fully-connected network with 2 outputs, one for each action. The network is trained to predict the expected value for each action, given the input state. The action with the highest expected value is then chosen.
First, let’s import needed packages. Firstly, we need gymnasium for the environment, installed by using pip:
pip3 install gymnasium
10.3.2 Replay Memory
10.3.3 DQN Algorithm
Our environment is deterministic, so all equations presented here are also formulated deterministically for the sake of simplicity. In the reinforcement learning literature, they would also contain expectations over stochastic transitions in the environment.
Our aim will be to train a policy that tries to maximize the discounted, cumulative reward \(R_{t_0} = \sum_{t = t_0}^{\infty} {\gamma}^{t - t_0} \cdot r_t\), where \(R_{t_0}\) is also known as the return. The discount, \(\gamma\), should be a constant between 0 and 1 that ensures the sum converges. A lower \(\gamma\) makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about. It also encourages agents to collect reward closer in time than equivalent rewards that are temporally far away in the future.
The main idea behind Q-learning is that if we had a function \(Q^*: State \times Action \rightarrow \mathbb{R}\), that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:
However, we don’t know everything about the world, so we don’t have access to \(Q^*\). But, since neural networks are universal function approximators, we can simply create one and train it to resemble \(Q^*\).
For our training update rule, we’ll use a fact that every \(Q\) function for some policy obeys the Bellman equation:
The difference between the two sides of the equality is known as the temporal difference error,:
To minimize this error, we will use the Huber loss. The Huber loss acts like the mean squared error when the error is small, but like the mean absolute error when the error is large - this makes it more robust to outliers when the estimates of \(Q\) are very noisy. We calculate this over a batch of transitions, \(B\), sampled from the replay memory: