7

I have read a lot about RL algorithms, that update the action-value function at each step with the currently gained reward. The requirement here is, that the reward is obtained after each step.

I have a case, where I have three steps, that have to be passed in a specific order. At each step the agent has to make a choice between a range of actions. The actions are specific for each step.

To give an example for my problem: I want the algorithm to render a sentence of three words. For the first word the agent may choose a word out of ['I', 'Trees'], the second word might be ['am', 'are'] and the last word could be chosen from ['nice', 'high']. After the agent has made its choices, the reward is obtained once for the whole sentence.

Does anyone know which algorithms to use in this kind of problem?


To give a bit more detail on what I already tried:

I thought that using value iteration would be an reasonable approach to test. My problem here is that I don't know how to assign the discounted reward for the chosen operations.

For example after the last choice I get a reward of 0.9. But how do I update the value for the first action (choosing out of I and Trees in my example)?

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
Jan
  • 361
  • 3
  • 13

2 Answers2

4

You don't need to have a reward on every single timestep, reward at the end is enough. Reinforcement learning can deal with temporal credit assignment problem, all algorithms are designed to work with it. Its enough to define a reward at the end where you, for example, give a reward of 1 if sentence is satisfactory or -1 if it isn't. Regular tabular Q-learning would easily solve the toy problem that you gave as an example.

Brale
  • 2,416
  • 1
  • 7
  • 15
4

You are describing a straightforward Markov Decision Process that could be solved by almost any Reinforcement Learning algorithm.

I have read a lot about RL algorithms, that update the action-value function at each step with the currently gained reward. The requirement here is, that the reward is obtained after each step.

This is true. However, the reward value can be zero, and it is quite common to have most states and actions result in zero reward with only some specific ones returning a non-zero value that makes a difference to the goals of the agent.

When this is more extreme - e.g. only one non-zero reward per thousand steps - this is referred to as "sparse rewards" and can be a problem to handle well. Your three steps then reward situation is nowhere near this, and not an issue at all.

I thought that using value iteration would be an reasonable approach to test. My problem here is that I don't know how to assign the discounted reward for the chosen operations.

Value iteration should be fine for your test problem, providing the number of choices for words is not too high.

The way you assign discounted rewards is to use the Bellman equation as an update:

$$v(s) \leftarrow \text{max}_{a}[\sum_{s',r} p(s',r|s,a)(r + \gamma v(s'))]$$

For a deterministic environment you can simplify the sum, as $p(s',r|s, a)$ will be 1 for a specific combination of $s', r$

It doesn't matter that $r = 0$ for the first two steps. The Bellman equation links time steps, by connecting $v(s)$ and $v(s')$. So, over many repetitions of value iteration's main loop, the episode end rewards get copied - with the discount applied - to their predecessor states. Very quickly in your case you will end up with values for the start states.

For example after the last choice I get a reward of 0.9. But how do I update the value for the first action (choosing out of I and Trees in my example)?

You don't do it directly in a single step. What happens is repeating the value iteration loop copies the best values to their predecessor states, one possible time step at a time.

On the first loop through all states, you will run an update something like:

$$v(\text{'I'}) \leftarrow 0 + \gamma v(\text{'I am'})$$

and $v(\text{'I am'}) = 0$ initially, so it will learn nothing useful in this first loop. However, it will also learn in the same loop:

$$v(\text{'I am'}) \leftarrow 0 + \gamma v(\text{'I am nice'})$$

So assuming $\gamma = 0.9$ and $v(\text{'I am nice'}) = 0.9$, and that "I am high" scores less than 0.9, then it will set $v(\text{'I am'}) = 0.81$.

On the next loop through states in value iteration, it will then set $v(\text{'I'}) = 0.727$ (assuming "I am" beats "I are" for maximum value)

Neil Slater
  • 33,739
  • 3
  • 47
  • 66