Why are there up to $m^2$ action values when we consider the complexity of DP based on $q(x,u)$?

Question

Please see slide 32 in the following lecture slides on DP:

https://groups.uni-paderborn.de/lea/share/lehre/reinforcementlearning/lecture_slides/built/Lecture03.pdf

Let $m$ the size size of action space in MPD. Why are there up to $m^2$ action values when we consider the complexity of DP based on $q(x,u)$ ?

cinch · Accepted Answer · 2024-03-19T04:16:18.570

To fully understand generically the transition dynamics from state $x$ to state $x′$ in terms of action values without state values involvement, we need to evaluate all possible combinations of actions for both states via their respective action values $q(x,u),q(x',u')$. Since for each state there are at most $m$ possible actions, this means there are at most $m^2$ possible combinations of actions for every pair of states $x$ and $x′$s. Furthermore there're at most $n^2$ state transition pairs obviously, therefore you arrive at complexity is inferior with $O(m^2·n^2)$.

Why are there up to $m^2$ action values when we consider the complexity of DP based on $q(x,u)$?

1 Answers1