5

I'm working with the FrozenLake environment (8x8) from Gymnasium.

In the deterministic case (is_slippery=False), I understand that using value iteration can converge to the true Q-values, since the environment is fully observable and transitions are predictable.

However, in the stochastic case (is_slippery=True), actions may not always go in the intended direction due to the agent "slipping." I'm still applying value iteration and getting results, but this raises the question:

Are the results from value iteration still valid as true Q-values when the environment is stochastic?

My main concerns are:

  • Can we still interpret the resulting value function and derived Q-values as the "true" optimal values under the given transition probabilities?
  • Does stochasticity invalidate the correctness of value iteration in this setting?
  • Are there any specific convergence concerns or limitations when using value iteration in a stochastic MDP like FrozenLake with is_slippery=True?

Appreciate any theoretical or empirical insights into this — especially in the context of small environments like 8x8.

Thanks in advance!

Jien Weng
  • 69
  • 4

2 Answers2

5

Your tagged Q-learning as a model free TD learning algorithm is different from dynamic programming value iteration as a model-based planning algorithm based on iterated Q-values. OpenAI Gym’s environment provides a complete description of the MDP in the form of a transition probability matrix accessible via something like env.P as shown in below source code of FrozenLake.

def __init__(
        self,
        render_mode: Optional[str] = None,
        desc=None,
        map_name="4x4",
        is_slippery=True,
    ):
        ...
        nA = 4
        nS = nrow * ncol
        ...
        self.P = {s: {a: [] for a in range(nA)} for s in range(nS)}
        ...
        for row in range(nrow):
            for col in range(ncol):
                s = to_s(row, col)
                for a in range(4):
                    li = self.P[s][a]
                    letter = desc[row, col]
                    if letter in b"GH":
                        li.append((1.0, s, 0, True))
                    else:
                        if is_slippery:
                            for b in [(a - 1) % 4, a, (a + 1) % 4]:
                                li.append(
                                    (1.0 / 3.0, *update_probability_matrix(row, col, b))
                                )
                        else:
                            li.append((1.0, *update_probability_matrix(row, col, a)))

The same theoretical guarantees apply whether the environment is deterministic or stochastic for value iteration algorithm as long as you have a correct model of the transition probabilities. Value iteration is designed to compute the unique fixed point of the Bellman optimality operator which incorporates the transition probabilities of your MDP. In the stochastic case, these probabilities represent the chance that the agent will slip or follow the intended direction.

The iterated Q-values are the expectation of the cumulative reward over a range of potential outcomes from a state-action pair. When the agent executes the policy in a stochastic environment, individual episode returns may vary due to randomness, but the optimal state value along with its policy implied by the value iteration does not rely on the environment being deterministic, it only requires that the transition probabilities and the reward function are well defined. In a small stochastic environment such as an 8×8 grid, since the Q-value updates are averaged over a range of potential outcomes, a bit more iterations are needed to propagate high values throughout the grid to reach a certain numeric precision compared with a deterministic setting.

cinch
  • 11,000
  • 3
  • 8
  • 17
4

Yes, value iteration still works in a stochastic setting. It's easy to think of stochasticity as something magical but it's just a different MDP.

The quality (Q) of an action is the expected reward of taking the action and following the policy PI from there out. If a given action has multiple resultant states (S1, ..., Sn), as it does in a stochastic setting, the Q value is the expected return of the action plus the value of the resultant states weighted by their transition probability.

From a logical standpoint this makes sense. Which action should have a higher Q value, one which has a 50/50 chance to lead to resultant states with values of 50 and -50 or one which has a 90/10 chance to lead to states with values 50 and -50? It is of course the second action.

By the law of large numbers, value iteration will eventually converge to the Q values of the optimal policy. It should experience all possible transitions for all actions and arrive at the correct value.

The only gotchas would be in the settings of value iteration. For example, if value iteration has been set up to stop when values stop moving by some delta and a certain transition is exceedingly rare, it is possible it misses that transition and "converges" early. But on an infinite timeline and otherwise correct settings, value iteration will converge.

foreverska
  • 2,347
  • 4
  • 21