0

I'm working on an RL algorithm that receive a list of orders and needs to find the optimal clusters considering different parameters such as due date, location, etc. I don't know what should be the length of the episode and how it can impact on the results of the training. Also I don't know how to consider the situation where the model won't take an action: should it be considered as a bad action or good? If it's a good action, how should I compute the reward?

DeepQZero
  • 1,733
  • 1
  • 10
  • 36

0 Answers0