How can reinforcement learning be unsupervised learning if it uses deep learning?

Question

I was watching a video in my online course where I'm learning about A.I. I am a very beginner in it.

At one point in the course, the instructor says that reinforcement learning (RL) needs a deep learning model (NN) to perform an action. But for that, we need expected results in our model for the NN to learn how to predict the Q-values.

Nevertheless, at the beginning of the course, they said to me that RL is an unsupervised learning approach because the agent performs the action, receives the response from the environment, and finally takes the more likely action, that is, with the highest Q value.

But if I'm using deep learning in RL, for me, RL looks like a supervised learning approach. I'm a little confused about these things, could someone give me clarifications about them?

nbro · Accepted Answer · 2020-05-05T14:49:37.570

Supervised learning

The supervised learning (SL) problem is formulated as follows.

You are given a dataset $\mathcal{D} = \{(x_i, y_i)_{i=1}^N$, which is assumed to be drawn i.i.d. from an unknown joint probability distribution $p(x, y)$, where $x_i$ represents the $i$th input and $y_i$ is the corresponding label. You choose a loss function $\mathcal{L}: V \times U \rightarrow \mathbb{R}$. Then your goal is to minimize the so-called empirical risk

$$R_{\mathcal{D}}[f]=\frac{1}{N} \sum_{i=1}^N \mathcal{L}(x_i, f(x_i)) \tag{0}\label{0}$$

with respect to $f$. In other words, you want to find the $f$ that minimizes the average above, which can also be formally written as $$ f^* = \operatorname{argmin}_f R[f] \tag{1}\label{1} $$ The problem \ref{1} is called the empirical risk minimization because it is a proxy problem for the expected risk minimization (but you can ignore this for now).

Reinforcement learning

In reinforcement learning, you typically imagine that there's an agent that interacts, in time steps, with an environment by taking actions. At each time step $t$, the agent takes $a_t$ in the state $s_t$, receives a reward $r_t$ from the environment and the agent and the environment move to another state $s_{t+1}$.

The goal of the agent is to maximize the expected return

$$\mathbb{E}\left[ G_t \right] = \mathbb{E}\left[ \sum_{i=t+1}^\infty R_i \right]$$

where $t$ is the current time step (so we don't care about the past), $R_i$ is a random variable that represents the probable reward at time step $i$, and $G_t = \sum_{i=t+1}^\infty R_i $ is the so-called return (i.e. a sum of future rewards, in this case, starting from time step $t$), which is also a random variable.

In this context, the most important job of the programmer is to define a function $\mathcal{R}(s, a)$, the reward function, which provides the reinforcement (or reward) signal to the RL agent. $\mathcal{R}(s, a)$ will deterministically or stochastically determine the reward that the agent receives every time it takes action $a$ in the state $s$. (Note that $\mathcal{R}$ is different from $R_i$, which is a random variable that represents the reward at time step $i$).

What is the difference between SL and RL?

In RL, you (the programmer) need to define the reward function $\mathcal{R}$ and you want to maximize the expected return. On the other hand, in SL you are given (or you collect) a dataset $\mathcal{D}$, you choose $\mathcal{L}$ in \ref{0}, and the goal is to find the function $f^*$ that minimizes the empirical risk. So, these have different settings and goals, so they are different!

However, every SL problem can be cast as an RL problem. See this answer. Similarly, in certain cases, you can formulate an RL as an SL problem. So, although the approaches are different, they are related.

Is RL an unsupervised learning approach?

In RL, you do not tell the agent what action it needs to take. You only say that the action that was taken was "bad", "good" or "so so". The agent needs to figure out which actions to take based on your feedback. In SL, you explicitly say that, for this input $x_i$, the output should be $y_i$.

Some people may consider RL is an unsupervised learning approach, but I think this is wrong, because, in RL, the programmer still needs to define the reward function, so RL isn't totally unsupervised and it's also not totally supervised. For this reason, many people consider RL an approach that sits between UL and SL.

What is deep learning?

The term/expression deep learning (DL) refers to the use of deep neural networks (i.e. neural networks with many layers, where "many" can refer to more than 1 or 1000, i.e. it depends on the context) in machine learning, either supervised, unsupervised, or reinforcement learning. So, you can apply deep learning to SL, RL and UL. So, DL is not only restricted to SL.

Faizy · Answer 2 · 2024-07-09T14:45:36.797

Reinforcement learning can be confusing in relation to supervised and unsupervised learning, especially for deep learning models. Here is the simple explanation make this clearer:

Supervised Learning

In the case of supervised learning, we are given a function that maps inputs to their desired outputs based on example input-output pairs. The basic setup is like this:

Dataset: There is a dataset with inputs $(x_i)$ and corresponding labels $(y_i)$.
Objective: The objective is to learn a function $(f(x))$ which makes the most accurate possible predictions about the labels $(y_i)$.
Loss Function: You minimize a loss function $(L(f(x), y))$ over the dataset to find the best $(f)$.

Reinforcement Learning

In reinforcement learning, rather than being shown an example of what to do and being told this is the correct action for a given input, the agent has to learn from a scalar reward signal.

Actions and States: At every time step $(t)$ the agent is in state $(s_t)$, takes action $(a_t)$, and moves to a new state $(s_{t+1})$.
Rewards: The agent receives a reward $(r_t)$ after taking an action.
Goal: The goal is to maximize the cumulative reward, called the return $(G_t = \sum_{i=t+1}^\infty R_i)$.

The Role of Deep Learning in RL

Deep learning is commonly used in RL to approximate functions such as the policy (what action to take for a given state) and the value function (expected return from a state). But the presence of deep learning does not fundamentally change RL.

RL vs. Supervised Learning

Feedback: In supervised learning, you give explicit labels to each input. You provide rewards in RL: feedback signals that show the agent how to learn the best actions over time.
Learning Process: RL learns through exploration and exploitation, which is way more dynamic than static datasets used in supervised learning.

Is RL Unsupervised Learning?

Probably, some will argue that RL is unsupervised since it does not depend on labeled data like supervised learning. However, RL is not purely unsupervised because of the following two factors:

Guidance: The reward function is a form of guidance, indicating in some sense what the right thing to do is.
Goal-oriented: The agent learns to achieve an explicit goal—maximizing reward—a kind of supervision.

Reinforcement learning is squarely situated within this learning paradigm:

It is not strictly supervised in that it does not use labeled input-output pairs in a straightforward way.
It is not purely unsupervised because it relies on rewards, a form of feedback that orients learning.

Therefore, RL is often considered a separate machine learning category, which can leverage deep learning techniques to accomplish its goals.

Additional Resources

To gain deeper insights, consider watching the Reinforcement Learning Course by David Silver, which takes a comprehensive approach to understanding the principles and implementations of RL.