Questions tagged [finite-markov-decision-process]
7 questions
3
votes
2 answers
Is Monte Carlo Tree Search appropriate for problems with large state and action spaces?
I'm doing a research on a finite-horizon Markov decision process with $t=1, \dots, 40$ periods. In every time step $t$, the (only) agent has to chose an action $a(t) \in A(t)$, while the agent is in state $s(t) \in S(t)$. The chosen action $a(t)$ in…
D. B.
- 101
- 1
- 7
2
votes
1 answer
Recursive Least squares (RLS) for mini batch
For my application I am considering a learning problem where I simulate a bunch of episodes say '$n$' first, and than carry out the recursive least squares update. Similar to $TD(1)$.
I know that RLS can be used to update parameters being learned as…
Prakash Gawas
- 21
- 1
2
votes
1 answer
Continuous state and continuous action Markov decision process time complexity estimate: backward induction VS policy gradient method (RL)
Model Description: Model based (assume known of the entire model) Markov decision process.
Time($t$): Finite horizon discrete time with discounting factor
State($x_t$): Continuous multi-dimensional state
Action($a_t$): Continuous multi-dimensional…
leodongxu
- 21
- 2
1
vote
0 answers
How to generalize finite MDP to general MDP?
Suppose, for simplicity sake, to be in a discrete time domain with the action set being the same for all states $S \in \mathcal{S}$. Thus, in a finite Markov Decision Process, the sets $\mathcal{A}$, $\mathcal{S}$, and $\mathcal{R}$ have a finite…
gvgramazio
- 706
- 2
- 8
- 20
1
vote
0 answers
Is this a bandit problem or a MDP?
I am trying to understand if this problem can be casted both as a bandit problem as well as an MDP.
Lets assume that we are trying to optimize sales $y_t$ based on investments $x_{1, t}, x_{2, t}$ over some horizon $H$. To model sales for timestep…
hugh
- 53
- 3
0
votes
1 answer
How to formulate discounted return in cartpole?
I am trying to formulate a problem that aims to prolong the lifetime of the simulation, the same as the Cartpole problem. I aware that there are two types of return:
finite horizon undiscounted return (used for episodic problems)
$G = \sum_{t=0}^T…
Ngoc Bui
- 3
- 1
0
votes
1 answer
Converging to a wrong optimal policy if the agent is given more choices
I am a bit new to Reinforcement learning. So, I am extremely sorry if I am asking something obvious. I have written a small piece of code to find the optimal policy for a 5x5 grid problem.
Scenario 1. The agent is only given two choices (Up,…
Tyrion
- 3
- 2