4

Please forgive me for the implicity of the question, as I recently started studying Reinforcement Learning. I am supposed to study a system where the transition probabilities are known and I have to use Reinforcement Learning.

I try to unterstand the relation between Dynamic Programming and Reinforcement Learning.

Although I have studied several lectures, read the corresponding chapters of Barto & Sutton's book and watched some videos, it is still not clear if Dynamic Programming (specifically value iteration, policy iteration) is distinct from Reinforcement Learning or if policy iteration and value iteration are considered model-based algorithms of Reinforcement Learning.

Apart from this, I wonder if there is any sense using Reinforcement Learning when transition probabilities are known.

Thank you!

Annaassymeon2
  • 41
  • 1
  • 3

2 Answers2

5

Dynamic programming is an algorithm paradigm (i.e. a way to design algorithms) that can be applied to many problem domains, not just Markov decision processes (MDPs), as long as they satisfy certain conditions (optimal substructure and overlapping subproblems). Policy and value iteration are dynamic programming algorithms applied to problems that can be described as MDPs. If you want a general introduction to these algorithms, I'd recommend the book Introduction to Algorithms by Cormen et al., or maybe start with the Wikipedia article.

Reinforcement learning is an approach to solve MDPs when you don't usually know the true transition probabilities, although there are also RL algorithms that use or estimate these probabilities. See model-based RL. The difference is that RL is a trial-and-error approach, which is guided by a reward signal/function, i.e. you randomly take actions and progressively learn which ones give you more reward (aka reinforcement). The usual RL reference is Reinforcement Learning: An Introduction by Sutton and Barto.

nbro
  • 42,615
  • 12
  • 119
  • 217
2

In the comment section you already knew that policy iteration and value iteration are reinforcement learning (RL) application of Dynamic Programming. To further address your remaining question expressed in your comment, all value-based RL methods are well described by policy and value iteration as stated in your own reference, including some model-free RL methods such as Sarsa or Q-learning.

We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy-evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy, as suggested by the diagram to the right. If both the evaluation process and the improvement process stabilize, that is, no longer produce changes, then the value function and policy must be optimal.

Then we extend them to control following the general pattern of on-policy GPI, using $\epsilon$-greedy for action selection. We show results for n-step linear Sarsa on the Mountain Car problem.

cinch
  • 11,000
  • 3
  • 8
  • 17