1

I'm quite new on the study of reinforcement learning, and Im working on a communication problem with continuous large actions range for my final graduation work. I'm trying to use Gaussian Policy and Police Gradient methods for that implementation. I will try to explain the sequential logic of the task to better understand... From the current observation (o_t), the agent takes actions reaching a new state (s_t+1). This new state is indeed a function of the previous obersvation (o_t) and the actions (a_t) taken by the agent. However, this next state (s_t+1) is not the next observation that the agent will take into account to take new actions, being only useful for calculating the reward in that iteration for the policy optimization. My question consists of the possibility that RL can be applied to this type of problem, where the state reached by the agent's actions in a given instant of time is not my next observation that the agent will take into account to take the next actions.

  • Not an expert on RL myself yet, but it seems you are describing a POMDP = Partially Observable MDP (as opposed to a vanilla MDP) which is used a lot in RL. So yes :) – Felix Goldberg Jun 25 '23 at 11:00

1 Answers1

0

Yes, there are RL algorithms (dreamer v1,2,3) that predict world model to simulate what (h_t+1) given its sampled action and previous state (h_t). It runs simulated world model during critic-actor so it can be much more data efficient. https://arxiv.org/abs/2301.04104v1

ipoppo
  • 101
  • 1