5

A deterministic policy in the rock-paper-scissors game can be easily exploited by the opponent - by doing just the right sequence of moves to defeat the agent. More often than not, I've heard that a random policy is the optimal policy in this case - but the argument seems a little informal.

Could someone please expound on this, possibly adding more mathematical details and intuition? I guess the case I'm referring to is that of a game between two RL agents, but I'd be happy to learn about other cases too. Thanks!

EDIT: When would a random policy be optimal in this case?

stoic-santiago
  • 1,201
  • 9
  • 22

1 Answers1

6

For this, we will need game theory.

In game theory, an optimal strategy is one that cannot be exploited by the opponent even if they know your strategy.

Let's say you want a strategy where your move selection is not based on what happened before (so you are not trying to model your opponent, or trick them into believing you will always play scissors and then throw them off, anything like that). A strategy will look like $(P, S, R)$, where $P, S, R \in [0, 1], P+S+R = 1$. You select paper with probability $P$, scissors with probability $S$, rock with probability $R$. Now, if your probabilities are a bit uneven (for example $(0.5, 0.2, 0.3)$) an opponent can abuse that strategy. If your opponent plays with probabilities $(p, s, r)$, their expected reward (counting +1 for win, -1 for loss, 0 for draw) would be $0.5(s - r) + 0.2(r - p) + 0.3(p - s) = 0.1p + 0.2s - 0.3r$. If they wish to maximize their wins, they would play scissors all the time against you, and expect to have a distinct advantage over you.

In general, for a strategy $(P, S, R)$ for you and $(p, s, r)$ for your opponent, your opponent's winnings would be $P(s - r) + S(r - p) + R(p - s) = p(R-S) + s(P-R) + r(S - P)$. If all the partial derivatives of this, with respect to $p$, $s$ and $r$ are 0, the opponent has no way to maximize his winnings; they would have no incentive to play a particular move over any other move. This occurs when $P = S = R = \frac13$.

That's basically how to approach game theory: find a strategy so your opponent has no incentive to choose one action over another. The approach seems a bit counter-intuitive at first (you're trying to find the optimal strategy for your opponent instead of for yourself) but it works for many similar problems.