3

An MDP is a Markov Reward Process with decisions, it’s an environment in which all states are Markov. This is what we want to solve. An MDP is a tuple $(S, A, P, R, \gamma)$, where $S$ is our state space, $A$ is a finite set of actions, $P$ is the state transition probability function,

$$P_{ss'}^a = \mathbb{P}[S_{t+1} = s' | S_t = s, \hspace{0.1cm}A_t = a] \label{1}\tag{1}$$

and

$$R_s^a = \mathbb{E}[R_{t+1}| S_t =s, A_t = a]$$

and a discount factor $\gamma$.

This can be seen as a linear equation in $|S|$ unknowns, which is given by,

$$V = R + \gamma PV \hspace{1mm} \label{2}\tag{2}$$

$V$ is value of a state vector, $R$ is immediate reward vector, $P$ is transition probability matrix, where each element at $(i,j)$ in $P$ is given by, $ P[i][j] = P(i \mid j)$ i.e., probability that I am in state $j$ going to state $i$.

As $P$ is given, we treat, equation $\ref{2}$ as a linear equation in $V$. But $P[i][j] = \sum_a (\pi(a \mid j) \times \mathrm{p}(i \mid j, a) )$. But, $ \pi (a \mid s)$ (i.e., probability that I will take action a in state s) is NOT given.

So, how can we frame this problem as the solution to a system of linear equations in \ref{2}, if we only know $ P^a_{ss'}$ and we do not know $ \pi(a \mid s)$, which is needed to calculate $P[i][j]$?

nbro
  • 42,615
  • 12
  • 119
  • 217
Abc1729
  • 45
  • 4

1 Answers1

3

Your equations all look correct to me.

It is not possible to solve the linear equation for state values in the vector $V$ without knowing the policy.

There are ways of working with MDPs, through sampling of actions, state transitions and rewards, where it is possible to estimate value functions without knowing either $\pi(a|s)$ or $P^{a}_{ss'}$. For instance, Monte Carlo policy evaluation or single-step TD learning can both do this. It is also common to work with $\pi(a|s)$ known but $P^{a}_{ss'}$ and $R^{a}_{s}$ unknown in model-free control algorithms such as Q learning.

However, in your case, you are correct, in order to resolve the simultaneous equations you have presented, you do need to know $\pi(a|s)$

This is not as limiting as you might think. You can construct a control method using simultaneous equations, by starting with the policy set to some arbitrary policy. Either a randomly-chosen deterministic policy or the equiprobable policy are reasonable first guesses. Then, after each solution to linear equations, you improve the policy so that each action choice maximises the expected return. This is essentially the policy iteration algorithm but replacing the policy evaluation step with the linear equations method for calculating the values.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66