0

I have a problem similar to the vehicle routing problem (VRP) that I want to solve with reinforcement learning. In this problem, the agent starts from the point $(x_0, y_0)$, then it needs to travel through $N$ other points, $(x_1, y_1), \dots, (x_n, y_n)$. The goal is to minimize the distance traveled.

Right now, I am modeling a state as a point $(x, y)$. There are 8 possible actions: go east, go north-east, go north, go north-west, go-west, go south-west, go south, go south-east. Each action goes by a pace of 100 metres.

After reaching near a destination point, that destination point is removed from the list of destination points.

The reward is the reciprocal of total distance until all destination points reached (there's a short optimisation to arrange the remaining points for a better reward).

I'm using a DNN model to keep the policy of a reinforcement learning agent, so this DNN maps a certain state to suitable action.

However, after every action of the agent with a good reward, the training data are added with 1 more sample, it's kinda incremental learning.

Should the policy model be trained again and again with every new sample added in? This does take too much time.

Any better RL approach to the problem above?

nbro
  • 42,615
  • 12
  • 119
  • 217
Dan D
  • 1,318
  • 1
  • 14
  • 39

1 Answers1

0

I found out a concept called 'Experience Replay', which trains a single step every time a new data sample is added instead of training to max epochs.

That is, instead of this training loop:

for i in range(max_paces):
    find action for max reward;
    add to trajectory to make inp;
    train(batch_size=len(inp), epochs=max_epochs)

Do training this way (for a single-episode ML problem, no incremental):

for i in range(max_epochs):
    reset environment;
for j in range(max_paces):
    find action for max reward;
    add to trajectory to make inp;
    train(batch_size=len(inp), epochs=1)

Do training this way (for a multi-episode ML problem, incremental data):

for i in range(max_epochs):
    reset environment;
    inc_trajectory = get_random_past_data();
for j in range(max_paces):
    find action for max reward;
    add to inc_trajectory to make inp;
    train(batch_size=len(inp), epochs=1)

For multiple-episode problems especially those problems with unlimited episodes, the training loop needs to forget (ie. exclude from training) old episodes which are very distant in the past, or select a random number of old episodes (consider them a batch, random batch) to be in every round of experience replay. Anyway, without eliminating some old data, the amount of training data are too much since it's unlimited number of episodes.

Dan D
  • 1,318
  • 1
  • 14
  • 39