1

In general, $Q$ function is defined as

$$Q : S \times A \rightarrow \mathbb{R}$$ $$Q(s_t,a_t) = Q(s_t,a_t) + \alpha[r_{t+1} + \gamma \max\limits_{a} Q(s_{t+1},a) - Q(s_t,a_t)] $$

$\alpha$ and $\gamma$ are hyper-parameters. $r_{t+1}$ is the reward at next time step. $Q$ values are initialized arbitrarily.

In addition to the reward function, which other functions do I need to implement Q-learning?

satya
  • 187
  • 1
  • 8

1 Answers1

2

In addition to the RF [*], you also need to define an exploratory policy (an example is the $\epsilon$-greedy), which allows you to explore the environment and learn the state-action value function $\hat{q}$. Moreover, although you don't need to know the details (i.e. the specific probabilities of transitioning from one state to the other) of the transition model, often denoted by $p$, you need a function that returns you the next state $s'$ for each action $a$ that you take in the current state $s$. You may not need to define this function, but, for example, the next state could be given by some kind of simulator of the environment (for example, in the case of Atari games, the Atari simulator may provide you the next frame of the game, which you could use to build an approximation of the next state). You can read the Q-learning pseudocode here.

[*] The reward function is defined for the problem and, specifically, for the Markov Decision Process (MDP) that models the problem/environment. The RF is not defined only for applying the Q-learning to solve the problem (in fact, you could apply other algorithms, like SARSA), but you need the RF to use Q-learning; so, yes, you need to define/have the RF before applying Q-learning. You can think of the RF as the learning signal that is used to guide the agent towards the optimal policy, and that's why it's specific to each environment/problem. Note that, in theory, there could be more than one RF that leads to the optimal policy for an environment (see potential-based reward shaping for more details). (This paragraph was addressing what I originally thought was the question: I'm leaving it here because it may be relevant to the readers).

nbro
  • 42,615
  • 12
  • 119
  • 217