2

Reading the Retrace paper (Safe and efficient off-policy reinforcement learning) I saw they often use a matrix form of the Bellman operators, for example as in the picture below. How do we derive those forms? Could you point me to some reference in which the matter is explained?

I am familiar with the tabular RL framework, but I'm having trouble understanding the steps from operators to this matrix form. For example, why does $Q^{\pi} = (I -\gamma P^{\pi})^{-1}r$? I know that for the value $V$ we can write \begin{align} V = R + \gamma P^{\pi} V \\ V - \gamma P^{\pi} V = R \\ V (I -\gamma P^{\pi}) = R \\ V = R(I - \gamma P^{\pi})^{-1} \end{align} but this seems slightly different.

Picture from Safe and efficient off-policy reinforcement learning:

Picture from Retrace paper

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

2

There's not much to derive here it's simply a definition of Bellman operator, it comes from Bellman equation. If you're wondering why \begin{equation} Q^{\pi} = (I - \gamma P^{\pi})^{-1}r \tag{1} \end{equation} they state that $Q^{\pi}$ is a fixed point which means if you apply Bellman operator to it you get the same value \begin{equation} T^{\pi}(Q^{\pi}) = Q^{\pi} \end{equation} You can easily check that since from $(1)$ \begin{equation} r = (I-\gamma P^{\pi})Q^{\pi} \end{equation} if you plug it in definition of Bellman operator you get \begin{align} T^{\pi}(Q^{\pi}) &= r + \gamma P^{\pi} Q^{\pi}\\ &= (I - \gamma P^{\pi})Q^{\pi}+ \gamma P^{\pi} Q^{\pi}\\ &= Q^{\pi} \end{align}

Brale
  • 2,416
  • 1
  • 7
  • 15