In Q-learning, wouldn't it be better to simply iterate through all possible states?

Question

In Q-learning, all resources I've found seem to say that the algorithm to update the Q-table should start at some initial state, and pick actions (which are sometimes random) to explore the state space.

However, wouldn't it be better/faster/more thorough to simply iterate through all possible states? This would ensure that the entire table is updated, instead of just the states we happen to visit. Something like this (for each epoch):

for state in range(NUM_STATES):
  for action in range(NUM_ACTIONS):
    next_state, reward = env.step(state, action)
    update_q_table(state, action, next_state, reward)

Is this a viable option? The only drawback I can think of is that it wouldn't be efficient for huge state spaces.

score 5 · Accepted Answer · edited Feb 07 '21 at 16:49

If your algorithm is executed multiple (or enough) times using an outer loop, it would converge to similar results as Q-learning would with $\gamma = 0$ (as you don't look what is the expected future reward).

In this case, the difference is that you would pass as much time to explore each possible couple of (state, action) while Q-learning would pass more time on the pair which seem more promising and, as you've said this wouldn't be efficient for a problem with a huge number of pair (state, action).

If the algorithm is executed only once, then, even for a problem with a few pairs (state, action), you need to assume that an action effected on a state will always bear the same result for your method to work.

In most cases, it isn't true either because there is some sort of randomness in the reward system or in the action (your agent can fail to make an action) or because the state of your agent is limited to its knowledge and so doesn't represent perfectly the world (and so the consequence of its action can vary just like if the reward had some randomness).

Finally, your algorithm doesn't look at the expected future reward, so it would be equivalent to having $\gamma = 0$. This could be fixed by adding a new loop updating the table after your current loops if you execute your algorithm only one time or by adding the expected future reward directly to your Q-table if there is an outer loop.

So, in conclusion, without the outer loop, your idea would work for a system with few pairs of (state, action), where your agent has a perfect and complete knowledge of its world, the reward doesn't vary, and where an agent can't fail to accomplish an action.

While these kinds of systems indeed exist, I don't think that it's an environment where one should use Q-learning (or another form of reinforcement learning), except if it's for educational purposes.

With an outer loop, your idea would work if you are willing to pass more time training to have a more precise Q-table on the least promising pair of (state, action).

score 2 · Answer 2 · answered Feb 01 '21 at 19:43

In short, yes, provided that you have a small number of states.

In pretty much any real system, the number of states is much higher than you could ever hope to explore exhaustively in any reasonable time. This is why you need to set some sort of exploration/exploitation policy to make sure that you mostly visit promising states while also checking states that might look initially poor but may lead to better states as you explore further. As a few minutes thought will convince you, determining the exact nature of that exploration/exploitation trade-off is probably the most important aspect of effective Q-Learning (and pretty much any other search algorithm for that matter).

score -2 · Answer 3 · answered Feb 02 '21 at 06:27

The issue is trading off reward vs learning, Q-learning attempts to learn and do things which produce reward(basically, operate suboptimally).

I'm not sure if Q-Learning is actually any different performance wise, short-term Q-Learning will produce more reward(probably) but will also miss states.

In Q-learning, wouldn't it be better to simply iterate through all possible states?

3 Answers3