In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?
3 Answers
A deterministic policy is a function of the form $\pi_{\mathbb{d}}: S \rightarrow A$, that is, a function from the set of states of the environment, $S$, to the set of actions, $A$. The subscript $_{\mathbb{d}}$ only indicates that this is a ${\mathbb{d}}$eterministic policy.
For example, in a grid world, the set of states of the environment, $S$, is composed of each cell of the grid, and the set of actions, $A$, is composed of the actions "left", "right", "up" and "down". Given a state $s \in S$, $\pi(s)$ is, with probability $1$, always the same action (e.g. "up"), unless the policy changes.
A stochastic policy can be represented as a family of conditional probability distributions, $\pi_{\mathbb{s}}(A \mid S)$, from the set of states, $S$, to the set of actions, $A$. A probability distribution is a function that assigns a probability for each event (in this case, the events are actions in certain states) and such that the sum of all the probabilities is $1$.
A stochastic policy is a family and not just one conditional probability distribution because, for a fixed state $s \in S$, $\pi_{\mathbb{s}}(A \mid S = s)$ is a possibly distinct conditional probability distribution. In other words, $\pi_{\mathbb{s}}(A \mid S) = \{ \pi_{\mathbb{s}}(A \mid S = s_1), \dots, \pi_{\mathbb{s}}(A \mid S = s_{|S|})\}$, where $\pi_{\mathbb{s}}(A \mid S = s)$ is a conditional probability distribution over actions given that the state is $s \in S$ and $|S|$ is the size of the set of states of the environment.
Often, in the reinforcement learning context, a stochastic policy is misleadingly (at least in my opinion) denoted by $\pi_{\mathbb{s}}(a \mid s)$, where $a \in A$ and $s \in S$ are respectively a specific action and state, so $\pi_{\mathbb{s}}(a \mid s)$ is just a number and not a conditional probability distribution. A single conditional probability distribution can be denoted by $\pi_{\mathbb{s}}(A \mid S = s)$, for some fixed state $s \in S$. However, $\pi_{\mathbb{s}}(a \mid s)$ can also denote a family of conditional probability distributions, that is, $\pi_{\mathbb{s}}(A \mid S) = \pi_{\mathbb{s}}(a \mid s)$, if $a$ and $s$ are arbitrary. Alternatively, $a$ and $s$ in $\pi_{\mathbb{s}}(a \mid s)$ are just (dummy or input) variables of the function $\pi_{\mathbb{s}}(a \mid s)$ (i.e. p.m.f. or p.d.f.): this is probably the most sensible way of interpreting $\pi_{\mathbb{s}}(a \mid s)$ when you see this notation (see also this answer). In this case, you could also think of a stochastic policy as a function $\pi_{\mathbb{s}} : S \times A \rightarrow [0, 1]$, but, in my view, although this may be the way you implement a stochastic policy in practice, this notation is misleading, as the action is not conceptually an input to the stochastic policy but rather an output (but in the end this is also just an interpretation).
In the particular case of games of chance (e.g. poker), where there are sources of randomness, a deterministic policy might not always be appropriate. For example, in poker, not all information (e.g. the cards of the other players) is available. In those circumstances, the agent might decide to play differently depending on the round (time step). More concretely, the agent could decide to go "all-in" $\frac{2}{3}$ of the times whenever it has a hand with two aces and there are two uncovered aces on the table and decide to just "raise" $\frac{1}{3}$ of the other times.
A deterministic policy can be interpreted as a stochastic policy that gives the probability of $1$ to one of the available actions (and $0$ to the remaining actions), for each state.
 
    
    - 42,615
- 12
- 119
- 217
Deterministic Policy :
Its means that for every state you have clear defined action you will take
For Example: We 100% know we will take action A from state X.
Stochastic Policy :
Its mean that for every state you do not have clear defined action to take but you have probability distribution for actions to take from that state.
For example there are 10% chance of taking action A from state S, There are 20% chance of taking B from State S and there are 70% chance of taking action C from state S, Its mean we don't have clear defined action to take but we have some probability of taking actions.
 
    
    - 141
- 3
Apart from the answers above,
Stochastic Policy function: $\pi (s_1s_2 \dots s_n, a_1 a_2 \dots a_n): \mathcal S \times \mathcal A \rightarrow [0,1]$ is the probability distribution function, that, tells the probability that action sequence $a_1a_2 \dots a_n$ may be chosen in state sequence $s_1 s_2 \dots s_n$[2][3].
In Markov Decision Process (MDP), it's only $\pi (s, a)$ following the assumptions[1]: $$ \mathbb P(\omega_{t+1}| \omega_t, a_t) = \mathbb P(\omega_{t+1}| \omega_t,a_t, \dots \omega_o,a_o)$$ Where $\omega \in \Omega$ which is the set of Observations. $\mathcal A, \mathcal S$ denote the set of actions and states respectively. Since, the next observation is dependent only on present states and not the past, the policy function only needs the present state and action as parameter.
The next action is chosen as[2]: $$ a^* = \arg \max_a \pi(s_{t+1}, a) \quad\forall a \in \mathcal A $$
Deterministic Policy function [3]: is a special case of Stochastic Policy function where for particular $a_o \in \mathcal A$, $\pi(s, a_n) = \delta^o_n$ for all $a_n \in \mathcal A$. Here, we are totally certain to choose particular action $a_o$ in some arbitrary state $s$ and no other. Here $\delta$ is Kronecker delta. Since, the probability distribution here is discrete, it's often written in the form of $\pi(s): \mathcal S \rightarrow \mathcal A$, where the function takes arbitrary state $s$ and maps it to an action $a$ which is 100% probable.
IMPORTANT
The Stochastic Policy function is not meant to be confused with the Transition Function[2] (which is also a Probability Distribution Function), $T(s_t, a_t, s_{t+1}): \mathcal S \times \mathcal A \times \mathcal S \rightarrow [0, 1]$ which tells the probability that - at state $s_t$, the action $a_t$ will lead us to next state $s_{t+1}$.
References:
- https://ocw.mit.edu. 6.825 Techniques in Artificial Intelligence. https://ocw.mit.edu. Page Number - 6. Web. 6 May 2020 
- Simonini, Thomas. https://www.freecodecamp.org .An introduction to Policy Gradients with Cartpole and Doom. 9 May 2018. Web. 6 May 2020. 
- https://www.computing.dcu.ie/. Reinforcement Learning. 2.1.1 Special case - Deterministic worlds. Web. 6 May 2020 
 
    
    - 141
- 9
