How does one prove the uniqueness of the value function obtained from value iteration in the case of bounded and undiscounted rewards? I know that this can be proven for the discounted case pretty easily using the Banach fixed point theorem.
Asked
Active
Viewed 234 times
1 Answers
1
In the case where the reward is undiscounted, there is no guarantee of convergence as the iteration procedure is not a strict contraction.
Unfortunately I can't find the math mode on the ai stackexchange so my answer can't be very precise.
But an easy example is the following: Take a 'running' reward R of 0 to make things simpler, and a MDP with two states a and b. Take a transition matrix with 0's on the diagonal and 1 off diagonal. You will see that the algorithm will always flip the values of V(a) and V(b), and hence no convergence.
me47
- 111
- 1