For questions related to policies (as defined in reinforcement learning or other AI sub-fields).
Questions tagged [policies]
76 questions
16
votes
3 answers
Is the optimal policy always stochastic if the environment is also stochastic?
Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?
Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and…
nbro
- 42,615
- 12
- 119
- 217
15
votes
4 answers
What does "stationary" mean in the context of reinforcement learning?
I think I've seen the expressions "stationary data", "stationary dynamics" and "stationary policy", among others, in the context of reinforcement learning. What does it mean? I think stationary policy means that the policy does not depend on time,…
Paula Vega
- 438
- 4
- 9
9
votes
1 answer
What is the difference between a stationary and a non-stationary policy?
In reinforcement learning, there are deterministic and non-deterministic (or stochastic) policies, but there are also stationary and non-stationary policies.
What is the difference between a stationary and a non-stationary policy? How do you…
nbro
- 42,615
- 12
- 119
- 217
8
votes
3 answers
What is the difference between a stochastic and a deterministic policy?
In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?
nbro
- 42,615
- 12
- 119
- 217
6
votes
1 answer
Proof that there always exists a dominating policy in an MDP
I think that it is common knowledge that for any infinite horizon discounted MDP $(S, A, P, r, \gamma)$, there always exists a dominating policy $\pi$, i.e. a policy $\pi$ such that for all policies $\pi'$: $$V_\pi (s) \geq V_{\pi'}(s) \quad…
MMM
- 185
- 3
6
votes
1 answer
What is the relation between a policy which is the solution to a MDP and a policy like $\epsilon$-greedy?
In the context of reinforcement learning, a policy, $\pi$, is often defined as a function from the space of states, $\mathcal{S}$, to the space of actions, $\mathcal{A}$, that is, $\pi : \mathcal{S} \rightarrow \mathcal{A}$. This function is the…
nbro
- 42,615
- 12
- 119
- 217
5
votes
2 answers
Why is the derivative of this objective function 0 if the policy is deterministic?
In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.
$$
\nabla_{\theta} J(\theta)=E_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log…
jonperl
- 153
- 7
5
votes
2 answers
Given two optimal policies, is an affine combination of them also optimal?
If there are two different optimal policies $\pi_1, \pi_2$ in a reinforcement learning task, will the linear combination (or affine combination) of the two policies $\alpha \pi_1 + \beta \pi_2, \alpha + \beta = 1$ also be an optimal policy?
Here I…
yang liu
- 53
- 4
5
votes
1 answer
How do I compute the variance of the return of an evaluation policy using two behaviour policies?
Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…
Amin
- 481
- 2
- 12
4
votes
1 answer
An example of a unique value function which is associated with multiple optimal policies
In the 4th paragraph of
http://www.incompleteideas.net/book/ebook/node37.html
it is mentioned:
Whereas the optimal value functions for states and state-action pairs are unique for a given MDP, there can be many optimal policies
Could you please…
Melanie A
- 143
- 2
4
votes
1 answer
Why do we have two similar action selection strategies for UCB1?
In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…
nbro
- 42,615
- 12
- 119
- 217
4
votes
1 answer
Why doesn't value iteration use $\pi(a \mid s)$ while policy evaluation does?
I was looking at the Bellman equation, and I noticed a difference between the equations used in policy evaluation and value iteration.
In policy evaluation, there was the presence of $\pi(a \mid s)$, which indicates the probability of choosing…
Chukwudi Ogbonna
- 125
- 5
4
votes
3 answers
Is the policy really invariant under affine transformations of the reward function?
In the context of a Markov decision process, this paper says
it is well-known that the optimal policy is invariant to positive affine transformation of the reward function
On the other hand, exercise 3.7 of Sutton and Barto gives an example of a…
IssaRice
- 181
- 4
4
votes
1 answer
Why does having a fixed policy change a Markov Decision Process to a Markov Reward Process?
If a policy is fixed, it is said that a Markov Decision Process (MDP) becomes a Markov Reward Process (MRP).
Why is this so? Aren't the transitions and rewards still parameterized by the action and current state? In other words, aren't the…
Peter
- 43
- 3
4
votes
2 answers
Why is having low variance important in offline policy evaluation of reinforcement learning?
Intuitively, I understand that having an unbiased estimate of a policy is important because being biased just means that our estimate is distant from the truth value.
However, I don't understand clearly why having lower variance is important. Is…
Hunnam
- 227
- 1
- 6