I'm working with the FrozenLake environment (8x8) from Gymnasium.
In the deterministic case (is_slippery=False), I understand that using value iteration can converge to the true Q-values, since the environment is fully observable and transitions are predictable.
However, in the stochastic case (is_slippery=True), actions may not always go in the intended direction due to the agent "slipping." I'm still applying value iteration and getting results, but this raises the question:
Are the results from value iteration still valid as true Q-values when the environment is stochastic?
My main concerns are:
- Can we still interpret the resulting value function and derived Q-values as the "true" optimal values under the given transition probabilities?
- Does stochasticity invalidate the correctness of value iteration in this setting?
- Are there any specific convergence concerns or limitations when using value iteration in a stochastic MDP like FrozenLake with
is_slippery=True?
Appreciate any theoretical or empirical insights into this — especially in the context of small environments like 8x8.
Thanks in advance!