Purpose of using actor-critic algorithms under deterministic MDP dynamics?

Question

One of the main disadvantages of the MC Policy Gradient algorithm (REINFORCE) as described say here is the fact that it has high variance (returns, which we sample, will significantly vary from episode to episode). Therefore it is perfectly reasonable to use a critic to reduce variance and this is what for example Deep Deterministic Policy Gradient (DDPG) does.

Now, let's assume that we're given an MDP with completely deterministic dynamics. In that case, if we start from a specific state and follow a certain deterministic policy we will always obtain the exact same return (therefore we have zero bias and zero variance). If we follow a certain stochastic policy the return will vary depending on how much we explore, but under an almost-deterministic policy our variance will be quite small. In any case, there's no contribution to the variance from the deterministic MDP dynamics.

In deep reinforcement learning for portfolio optimization, many researchers (Xiong at al. for example) use historical market data for model training. The resulting MDP dynamics is of course completely deterministic (if historical prices are used as states) and there's no real sequentiality involved. Consequently, all return variance stems from the stochasticity of the policy itself. However, most researchers still use DDPG as a variance reduction mechanism.

What's the point of using DDPG for variance reduction when the underlying MDP used for training has deterministic dynamics? Why not simply use the Reinforce algorithm?

Neil Slater · Answer 1 · 2019-11-13T12:42:27.143

In deep reinforcement learning for portfolio optimization, many researchers (Xiong at al. for example) use historical market data for model training. The resulting MDP dynamics is of course completely deterministic (if historical prices are used as states) and there's no real sequentiality involved.

Whilst I cannot comment on the specific financial model, I think it unlikely that these researchers would apply RL without there being a sequence.

More likely in my opinion, the historic data feed forms a major part of the environment, but that there are still time steps and a state which depends on an agent's actions. For instance, in a trading simulation, provided the values of trades are below levels that would significantly alter the market itself, it may be a reasonable approximation to use the history of prices and other factual information that progress like a recording, plus have state include the agent's current portfolio of investment and working funds.

What's the point of using DDPG for variance reduction when the underlying MDP used for training has deterministic dynamics? Why not simply use the Reinforce algorithm?

Variance in returns occurs due to stochastic dynamics (if they are present) and the behaviour policy. You cannot use any RL control algorithm with a deterministic behaviour policy*. It would never gain any data that allowed it to assess alternative behaviour.

So in REINFORCE, which is on-policy (the behaviour policy and target policy are the same), and typically starts with near equiprobable action choices, there is high variance. It could be very high when measured over a long episode with many action choices. In basic REINFORCE, the variance is not controlled for, and training uses individual Monte Carlo style returns.

In DDPG, which is off-policy (the target policy is deterministic**, the behaviour policy is stochastic), there is still variance, but it is much reduced with the actor-critic mechanism, plus can be constrained by choice of noise function that relates the behaviour policy to the target policy. In addition, updates to policy and value functions can be made independently of episode ends, which can significantly speed up learning.

To determine what difference this makes for any experiment, you would need to compare the two algorithms on the same task. In practice DDPG will significantly out-perform REINFORCE on many tasks, including those with deterministic environments. However, there might be specific combinations where the simplicity of REINFORCE wins out, if only because there are less hyperparameters to tune.

On one point:

under an almost-deterministic policy our variance will be quite small

That's true, but how do you get to that stage of training with REINFORCE? It is by testing and working through more stochastic policies, which is what will take time. Your statement only applies to REINFORCE when the control problem is nearly complete, or if you take a short-cut and force the policy function into what you hope is a near-optimal policy. In which case you are engaging in a form of variance reduction - it may even work for some scenarios, but is likely not as general as applying an Actor-Critic algoirthm.

* Actually technically you can if the environment is stochastic in the right way so that you effectively explore all state/action combinations. But we are talking about deterministic environments here, and obviously the stochastic environment would introduce variance in returns.

** The target policy changes over time, and this introduces non-stationarity and bias for the critic component to deal with, but not technically variance.

Purpose of using actor-critic algorithms under deterministic MDP dynamics?

1 Answers1