I'm working on an RL algorithm that receive a list of orders and needs to find the optimal clusters considering different parameters such as due date, location, etc. I don't know what should be the length of the episode and how it can impact on the results of the training. Also I don't know how to consider the situation where the model won't take an action: should it be considered as a bad action or good? If it's a good action, how should I compute the reward?
Asked
Active
Viewed 32 times