Questions tagged [stochastic-policy]

For questions related to the concept of a stochastic policy (as defined in reinforcement learning), which is a function from a state to a probability distribution over actions (from that state).

13 questions
16
votes
3 answers

Is the optimal policy always stochastic if the environment is also stochastic?

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic? Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and…
8
votes
3 answers

What is the difference between a stochastic and a deterministic policy?

In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?
4
votes
2 answers

How do we estimate the value of a stochastic policy?

I'm learning about reinforcement learning, particularly policy gradient methods and actor-critic methods. I've noticed that many algortihms use stochastic policies during training (i.e. they select the actions from a probability distribution). I…
3
votes
1 answer

How is $v_*(s) = \max_{\pi} v_\pi(s)$ also applicable in the case of stochastic policies?

I am reading Sutton & Bartos's Book "Introduction to reinforcement learning". In this book, the defined the optimal value function as: $$v_*(s) = \max_{\pi} v_\pi(s),$$ for all $s \in \mathcal{S}$. Do we take the max over all deterministic policies,…
3
votes
1 answer

In the policy gradient equation, is $\pi(a_{t} | s_{t}, \theta)$ a distribution or a function?

I am learning about policy gradient methods from the Deep RL Bootcamp by Peter Abbeel and I am a bit stumbled by the math presented. In the lecture, he derives the gradient logarithm likelihood of a trajectory to be $$\nabla log P(\tau^{i};\theta) =…
calveeen
  • 1,311
  • 9
  • 18
3
votes
1 answer

What's the value of making the RL agent's output stochastic opposed to deterministic?

I have a question about a reinforcement learning problem. I'm training an agent to add or delete pixels in a [12 x 12] 2D space (going to be 3D in the future). Its action space consists of two discrete outputs: x[0-12] and y[0-12]. What would be…
SumakuTension
  • 275
  • 3
  • 9
3
votes
1 answer

Is it possible for value-based methods to learn stochastic policies?

Is it possible for value-based methods to learn stochastic policies? I'm trying to get a clear picture of the different categories for RL algorithms, and while doing so I started to think about settings where the optimal policy is stochastic…
3
votes
1 answer

Can Q-learning be used to derive a stochastic policy?

In my understanding, Q-learning gives you a deterministic policy. However, can we use some technique to build a meaningful stochastic policy from the learned Q values? I think that simply using a softmax won't work.
2
votes
1 answer

Can a policy with gaussian distribution allow two distinct optimal actions to have distinctively high probabilities?

As an example to show the benefits of stochastic policy, I often have seen the below grid world example. Five blocks in a row. the first, third, and fifth are white(distinguishable states), and the second and fourth are gray(for agent, these two…
2
votes
1 answer

Is a learned policy, for a deterministic problem, trained in a supervised process, a stochastic policy?

If I trained a neural network with 4 outputs (one for each action: move down, up, left, and right) to move an agent through a grid (deterministic problem). The output of the neural network is a probability distribution over the 4 actions, due to the…
2
votes
1 answer

Did Alphago zero actually beat Alphago 100 games to 0?

tl;dr Did AlphaGo and AlphaGo play 100 repetitions of the same sequence of boards, or were there 100 different games? Background: Alphago was the first superhuman go player, but it had human tuning and training. AlphaGo zero learned to be more…
0
votes
0 answers

Grayscale to RGB888 vs RGB332 to RGB888 in same colorization training between two universes

Suppose there are two parallel universes that train deep learning models for color resolution. The first universe uses grayscaled image as input that has dimension (640,480,1), the second universe uses RGB332 image as input that has same dimension…
0
votes
1 answer

Consequence of Dvoretzky Stochastic Approximation Theorem

I am trying to understand all the steps to prove the TD0 algorithm, and I am following a proof which uses a theorem of Tommi Jaakkola, Michael I. Jordan and Satinder P. Singh, in the paper: On the Convergence of Stochastic Iterative Dynamic…