4

I am working on a research project about the different reward functions being used in the RL domain. I have read up on Inverse Reinforcement Learning (IRL) and Reward Shaping (RS). I would like to clarify some doubts that I have with the 2 concepts.

In the case of IRL, the goal is to find a reward function based on the policy that experts take. I have read that recovering the reward function that experts were trying to optimize, and then finding an optimal policy from those expert demonstrations has a possibility of resulting in a better policy (e.g. apprenticeship learning). Why does it lead to a better policy?

nbro
  • 42,615
  • 12
  • 119
  • 217
calveeen
  • 1,311
  • 9
  • 18

1 Answers1

2

Inverse Reinforcement Learning (IRL) is a technique that attempts to recover the reward function that the expert is implicitly maximising based on expert demonstrations. When solving reinforcement learning problems, the agent maximises a reward function specified by the designer, and in the process of reward maximisation, accomplishes some task that it had set out to do. However, reward functions for certain tasks are sometimes difficult to specify by hand. For example the task of driving takes into consideration many different factors such as the distance of the car in front of him, the road conditions and whether or not the person needs to get to his destination quickly. A reward function can be hand specified based on these features. However, when there exists trade offs between these different features, it is difficult to know how to specify the different desiderata of these tradeoffs.

Instead of specifying the trade offs manually, it would be easier to recover a reward function from expert demonstrations using IRL. Such a reward function can lead to better generalisations to unseen states as long as the features of driving do not change.

In the case where reward shaping fails to learn a task (such as driving), it would be better to have someone demonstrate a task and learn a reward function from these demonstrations. Solving the MDP with the learnt reward function will thus yield a policy that should resemble the demonstrated behaviour. The reward function learnt should also generalise to unseen states and the agent acting in unseen states should be able to perform actions that an expert would take when he is placed in the same conditions, assuming that the unseen states come from the same distribution as the training states.

While Reward Shaping might be able to perform the same task as well, IRL might be able to do better, based on some performance metric that will differ from problem to problem.

calveeen
  • 1,311
  • 9
  • 18