Can Q-learning rewards and next states be non-deterministic?

Question

I am working in a team to develop a Q-learning based approach for hyperparameter tuning. I have a disagreement with one of my teammates on how they defined this problem. They defined it as follows:

The states are the values of the metric we are optimizing (ex. F1 score: 0, 0.1, 0.2, ..., 1.0).
The actions are to change one of the hyperparameters. So we have as many actions as we have hyperparameters.
The reward is the difference of the metric between the current state and the previous state. We evaluate the model we are optimizing on the current hyperparameter configuration.
Each of the mentioned actions have a sequence of values that we use whenever that action is picked. Ex. The batch size can have values [16,32,64]. When we choose the batch size action, we set it to 16. Next time we choose it, we set it to 32. Next time 64, then we come back to 16.

I have various issues with this formulation. Mainly:

The rewards and Q-values are non-deterministic. Choosing action a in state s, can be either a good or a bad decision, depending on the previous actions we chose, and whatever value this current action is going to return next. From my understanding, this means that the Q-values are useless now. Choosing an action might be good or bad, non-deterministic.
With this reward scheme, the first action will get a high reward no matter what, because there's no previous metric to subtract from it.
I still can't help but feel that this state space is problematic. Choosing action a in state s will also yield an arbitrary next state depending on the evaluated hyperparameter configuration.

Now I have tried getting my point of view across, but no matter how much I argue this person thinks i am wrong. Their use of ChatGPT unfortunately fuels this arrogance as it says yes to every one of their questions, and they treat it as their bible.

Is my argument reasonable? Is there really nothing wrong with this approach? I don't know where else to ask this question

score 1 · Answer 1 · answered Jan 15 '24 at 04:16

There's a lot being asked here and I don't know that I'm tracking well enough to comment on this formulation. But I will try to clarify some RL theory and answer the title question.

If the transition from one state to the next state is completely non-deterministic, it is not an MDP and a Q-table will likely just give some metric about the underlying random process but not be a reliable predictor.

However, there is likely some non-determinism in most realistic processes (sensor inaccuracy, under/overshoot action, action of unmeasured factors, etc) and Q-learning can still be used. Distributional RL even provides an insight into the non-determinism for greater policy accuracy.

Lastly commenting on your issue 1, just because the current state depends on previous states does not make it non-Markovian or non-deterministic as long as for a certain history and action there is a consistent next transition. In these cases it becomes an n-step MDP. Typical treatments involve a translation to belief states and proceed with Q-learning from there.

Can Q-learning rewards and next states be non-deterministic?

1 Answers1