24

Coming from a process (optimal) control background, I have begun studying the field of deep reinforcement learning.

Sutton & Barto (2015) state that

particularly important (to the writing of the text) have been the contributions establishing and developing the relationships to the theory of optimal control and dynamic programming

With an emphasis on the elements of reinforcement learning - that is, policy, agent, environment, etc., what are the key differences between (deep) RL and optimal control theory?

In optimal control we have, controllers, sensors, actuators, plants, etc, as elements. Are these different names for similar elements in deep RL? For example, would an optimal control plant be called an environment in deep RL?

nbro
  • 42,615
  • 12
  • 119
  • 217
Bionic Buffulo
  • 383
  • 1
  • 2
  • 8

2 Answers2

10

The same book Reinforcement learning: an introduction (2nd edition, 2018) by Sutton and Barto has a section, 1.7 Early History of Reinforcement Learning, that describes what optimal control is and how it is related to reinforcement learning. I will quote the most relevant part to answer your question, but you should read all that section to have a full understanding of the relationship between optimal control and reinforcement learning.

The term "optimal control" came into use in the late 1950s to describe the problem of designing a controller to minimize or maximize a measure of a dynamical system's behavior over time. One of the approaches to this problem was developed in the mid-1950s by Richard Bellman and others through extending a nineteenth-century theory of Hamilton and Jacobi. This approach uses the concepts of a dynamical system's state and of a value function, or "optimal return function", to define a functional equation, now often called the Bellman equation. The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a). Bellman (1957b) also introduced the discrete stochastic version of the optimal control problem known as Markov decision processes (MDPs). Ronald Howard (1960) devised the policy iteration method for MDPs. All of these are essential elements underlying the theory and algorithms of modern reinforcement learning.

To answer your specific questions.

In optimal control we have, controllers, sensors, actuators, plants, etc, as elements. Are these different names for similar elements in deep RL? For example, would an optimal control plant be called an environment in deep RL?

Yes. In reinforcement learning (see the first footnote of the cited book on page 48), the term control is often used as a synonym for action. Similarly, the term controller (or decision maker) is used as a synonym for agent (and sometimes also a synonym for policy, given that the policy usually defines and controls the agent, although the concept of the agent is more abstract and we could associate more than one policy with the same agent). The term environment is also used as a synonym for controlled system (or plant).

See also section 38.8 Notes (page 530) of the book Bandit Algorithms by Csaba Szepesvari and Tor Lattimore.

nbro
  • 42,615
  • 12
  • 119
  • 217
9

As a supplement to nbro's nice answer, I think a major difference between RL and optimal control lies in the motivation behind the problem you're solving. As has been pointed out by comments and answers here (as well as the OP), the line between RL and optimal control can be quite blurry.

Consider the Linear-Quadratic-Gaussian (LQG) algorithm, which is generally considered to be an optimal control method. Here a controller is computed given a stochastic model of the environment and a cost function.

Now, consider AlphaZero, which is obviously thought of as an RL algorithm. AlphaZero learns a value function (and thus a policy/controller) in a perfect information setting with a known deterministic model.

So, it's not the stochasticity that separates RL from optimal control, as some people believe. It's also not the presence of a known model. I argue that the difference between RL and optimal control comes from the generality of the algorithms.

For instance, generally, when applying LQG and other optimal control algorithms, you have a specific environment in mind and the big challenge is modeling the environment and the reward function to achieve the desired behavior. In RL, on the other hand, the environment is generally thought of as a sort of black box. While in the case of AlphaZero the model of the environment is known, the reward function itself was not designed specifically for the game of chess (for instance, it's +1 for a win and -1 for a loss, regardless of chess, go, etc.). Furthermore, the neat thing with AlphaZero is that we can use it to train agents in virtually any perfect information game without changing the algorithm at all. Another difference with RL here is that the agent iteratively improves itself, while optimal control algorithms learn controllers offline and then stay fixed.

nbro
  • 42,615
  • 12
  • 119
  • 217
harwiltz
  • 1,166
  • 1
  • 8
  • 6