0

To be able to explain my question I thought it is probably better to consider the following example: Let's take an environment, where a bridge crane need to lift a barrel from the position "start" and move it to the position "goal" moving along the axes: $X, Y, Z$.

example

But the movement cannot be straight to the goal. The movement from the start to the goal is based on the presence of obstable in the enviroment and some pratical rules (e.g. the barrel must be lifted first, then horizontally moved and at the end dropped at the goal position).

I'm struggling to figure out, how to train an agent to be able to accomplish such a task. But here I want to share my thoughts:

  1. At first I thought to use PPO or SAC as model free algorithms and assigning different rewards, when reached the goal and the positions in between (let's call them waypoints) In other words, the agent starts from the start position and gets a reward of +100 points if the agent reaches $a$ in some time. Now... in order to move forward I would need to give a higher reward to $b$. But that means, that the agent will try during the training to reach $b$ directly instead going to $b$ over $a$.
  2. Even if the issues in the first point are somehow solvable there is the problem, that the agent would learn to reach the goal through the waypoints seen during the training, even if the new environment is now different (obstacles are in different position, the waypoints changed their positions, etc.). So the agent would learn "memorizing" the training environment, without generalization.

What could be the best strategy in such enviroments? Thanks

Dave
  • 214
  • 1
  • 11

1 Answers1

2

I do not have enough reputation to just comment, but here are my thoughts:

Do you actually need to use subgoals/waypoints? If your objective is simply how to train an agent to be able to accomplish such a task, then I am thinking you do not need to break it into subgoals/waypoints. Indeed, the environment you describe can be reduced to a 2d environment with a starting position and a single goal position, where the agent must find the shortest path between these two positions. Since the barrel must be lifted first whatever happens, this can be hard coded in the reset point (I don't think there is much use in training a RL agent to learn to just lift an object as there are not many ways this can be done!), same for the drop at the goal position (simply set the routine for the drop once the agent has reached the goal position).

If you actually need waypoints, then as @Luca Anzalone suggested in the comments, you can still include the goal's coordinate in your states.

For Domain Randomization, if the agent does not observe the positions (and size?) of the obstacles, then the policy will not be able to know where it can move (this is due to the overly-cautious or "conservative" nature of DR policies). The only way this would work is if your obstacles are distributed such that there is one specific path that is always opened. Otherwise, you should consider adding the obstacle information to your states (the agent does not need to observe the full space, but it should be able to at least observe when it is close to an obstacle, so that it can avoid it or move away).

If you really need to train the policy for lifting and dropping, then you could take a look at continual reinforcement learning (specifically Multi-task Learning?), but I think that even in this case, a properly designed reward function could be enough (one that penalizes the agent for starting to move while the barrel is still on the ground [of course this means a 3d environment instead of the 2d case I mentioned above] - the penalty can be proportional to the height of the barrel and maybe to the distance from the starting or goal position).

Ahnel
  • 116
  • 4