8

I'm doing a project on Reinforcement Learning. I programmed an agent that uses DDQN. There are a lot of tutorials on that, so the code implementation was not that hard.

However, I have problems understanding how one should come up with this kind of algorithms by starting from the Bellman equation, and I don't find a good understandable explanation addressing this derivation/path of reasoning.

So, my questions are:

  1. How is the loss to train the DQN derived from (or theoretically motivated by) the Bellman equation?
  2. How is it related to the usual Q-learning update?

According to my current notes, the Bellman equation looks like this

$$Q_{\pi} (s,a) = \sum_{s'} P_{ss'}^a (r_{s,a} + \gamma \sum_{a'} \pi(a'|s') Q_{\pi} (s',a')) \label{1}\tag{1} $$

which, to my understanding, is a recursive expression that says: The state-action pair gives a reward that is equal to the sum over all possible states $s'$ with the probability of getting to this state after taking action $a$ (denoted as $P_{ss'}^a$, which means the environment acts on the agent) times the reward the agent got from taking action $a$ in state $s$ + discounted sum of the probability of the different possible actions $a'$ times the reward of the state, action pair $s',a'$.

The Q-Learning iteration (intermediate step) is often denoted as:

$$Q^{new}(s,a) \leftarrow Q(s,a) + \alpha (r + \gamma \max_a Q(s',a') - Q(s,a)) \label{2}\tag{2}$$

which means that the new state, action reward is the old Q value + learning rate, $\alpha$, times the temporal difference, $(r + \gamma \max_a Q(s',a') - Q(s,a))$, which consists of the actual reward the agent received + a discount factor times the Q function of this new state-action pair minus the old Q function.

The Bellman equation can be converted into an update rule because an algorithm that uses that update rule converges, as this answer states.

In the case of (D)DQN, $Q(s,a)$ is estimated by our NN that leads to an action $a$ and we receive $r$ and $s'$.

Then we feed in $s$ as well as $s'$ into our NN (with Double DQN we feed them into different NNs). The $\max_a Q(s',a')$ is performed on the output of our target network. This q-value is then multiplied with $\gamma$ and $r$ is added to the product. Then this sum replaces the q-value from the other NN. Since this basic NN outputted $Q(s,a)$ but should have outputted $r + \gamma \max_a Q(s',a')$ we train the basic NN to change the weights, so that it would output closer to this temporal target difference.

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

2

The Bellman equation in RL is usually defined $$v_\pi(s) = \sum_a \pi(a|s) \sum_{s', r} p(s', r|s, a)\left[r + v_\pi(s')\right] = \mathbb{E}_{s' \sim p, a \sim \pi}\left[r(s, a) + v_\pi(s')\right] \; .$$ The way you have written it is correct, but I just thought I would point this out. Regardless, your intuition is correct in that it expresses a recursive relationship such that the value of your current state $s$ is equal to the sum of the expected reward from this state plus the expected value of the state you transition into.

You do, in fact, implement the Q-learning update in Deep Q-Learning. The loss function that you minimise in DQN is $$ L(\theta) = \mathbb{E}_{(s,a,r,s')\sim U(D)}\left[\left( r + \gamma \max_{a'}Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2 \right]\;$$ where $U(D)$ denotes uniformly at random from replay buffer $D$ and $\theta$ are your network parameters (the network parameterises the Q-function), and $\theta^-$ are a previous iteration of the parameters that are updated every $c$ episodes to help with convergence of the network.

As you can see, the loss function is minimising the 'Bellman error' error from your equation 2. Lets think about why this is.

The TD update you provide is gradually shifting the Q value for $(s, a)$ towards $r + \max_a Q(s', a)$ - this is what we want after all since it eventually converges to the optimal Q-function.

Now lets think about the Deep Q-learning case. We want our network to approximate $Q(s, a)$ and so if we train the network, using the MSE loss, with $r + \max_a Q(s', a)$ as our target then our network will gradually be shifted towards predicting $r + \max_aQ(s', a)$ (which again would give us optimal Q-values for state-action pairs), just like with the TD update.

This is assuming that you know how training of neural networks works so if you don't then I would recommend asking/searching for a relevant question that explains this.

David
  • 5,100
  • 1
  • 11
  • 33