What is the loss for policy gradients with continuous actions?

Question

I know with policy gradients used in an environment with a discrete action space are updated with $$ \Delta \theta_{t}=\alpha \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) v_{t} $$ where $v_t$ could be many things that represent how good the action was. And I know that this can be calculated by performing cross entropy loss with the target being what the network would have outputted if it were completely confident in its action (zeros with the index of the action chosen being one). But I don’t understand how to apply that to policy gradients that output the mean and variance of a Gaussian distribution for a continuous action space. What is the loss for these types of policy gradients?

I tried keeping the variance constant and updating the output with mean squared error loss and the target being the action it took. I thought this would end up pushing the mean towards actions with greater total rewards but it got nowhere in OpenAI’s Pendulum environment.

It would also be very helpful if it was described in a way with a loss function and a target, like how policy gradients with discrete action spaces can be updated with cross entropy loss. That is how I understand it best but it is okay if that is not possible.

Edit: for @Philipp. The way I understand it is that the loss function is the same with a continuous action space and the only thing that changes is the distribution that we get the log-probs from. In PyTorch we can use a Normal distribution for continuous action space and Categorical for discrete action space. The answer from David Ireland goes into the math but in PyTorch, that looks like log_prob = distribution.log_prob(action_taken) for any type of distribution. It makes sense that for bad actions we would want to decrease the probability of taking the action. Below is working code for both types of action spaces to compare them. The continuous action space code should be correct but the agent will not learn because it is harder to learn the right actions with a continuous action space and our simple method isn't enough. Look into more advanced methods like PPO and DDPG.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.categorical import Categorical #discrete distribution
import numpy as np
import gym
import math
import matplotlib.pyplot as plt
class Agent(nn.Module):
    def init(self,lr):
        super(Agent,self).init()
        self.fc1 = nn.Linear(4,64)
        self.fc2 = nn.Linear(64,32)
        self.fc3 = nn.Linear(32,2) #neural network with layers 4,64,32,2
    self.optimizer = optim.Adam(self.parameters(),lr=lr)

def forward(self,x):
    x = torch.relu(self.fc1(x)) #relu and tanh for output
    x = torch.relu(self.fc2(x))
    x = torch.sigmoid(self.fc3(x))
    return x


env = gym.make('CartPole-v0')
agent = Agent(0.001) #hyperparameters
DISCOUNT = 0.99
total = []
for e in range(500): 
    log_probs, rewards = [], []
    done = False
    state = env.reset()
    while not done:
        #mu = agent.forward(torch.from_numpy(state).float())
        #distribution = Normal(mu, SIGMA)
        distribution = Categorical(agent.forward(torch.from_numpy(state).float()))
        action = distribution.sample()
        log_probs.append(distribution.log_prob(action))
        state, reward, done, info = env.step(action.item())
        rewards.append(reward)
total.append(sum(rewards))

cumulative = 0
d_rewards = np.zeros(len(rewards))
for t in reversed(range(len(rewards))): #get discounted rewards
    cumulative = cumulative * DISCOUNT + rewards[t]
    d_rewards[t] = cumulative
d_rewards -= np.mean(d_rewards) #normalize
d_rewards /= np.std(d_rewards)

loss = 0
for t in range(len(rewards)):
    loss += -log_probs[t] * d_rewards[t] #loss is - log prob * total reward

agent.optimizer.zero_grad()
loss.backward() #update
agent.optimizer.step()

if e%10==0:
    print(e,sum(rewards)) 
    plt.plot(total,color='blue') #plot
    plt.pause(0.0001)    



def run(i): #to visualize performance
    for _ in range(i):
        done = False
        state = env.reset()
        while not done:
            env.render()
            distribution = Categorical(agent.forward(torch.from_numpy(state).float()))
            action = distribution.sample()
            state,reward,done,info = env.step(action.item())
        env.close()

Above is the discrete action space code for CartPole and below is the continuous action space code for Pendulum. Sigma (variance or standard deviation) is constant here but adding it is easy. Just make the final layer have two neurons and make sure sigma is not negative. Again, the pendulum code won't work because most environments with continuous action spaces are too complicated for such a simple method. Making it work would probably require a lot of testing for hyper parameters.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions.normal import Normal #continuous distribution
import numpy as np
import gym
import math
import matplotlib.pyplot as plt
import keyboard
class Agent(nn.Module):
    def init(self,lr):
        super(Agent,self).init()
        self.fc1 = nn.Linear(3,64)
        self.fc2 = nn.Linear(64,32)
        self.fc3 = nn.Linear(32,1) #neural network with layers 3,64,32,1
    self.optimizer = optim.Adam(self.parameters(),lr=lr)

def forward(self,x):
    x = torch.relu(self.fc1(x)) #relu and tanh for output
    x = torch.relu(self.fc2(x))
    x = torch.tanh(self.fc3(x)) * 2
    return x


env = gym.make('Pendulum-v0')
agent = Agent(0.01) #hyperparameters
SIGMA = 0.2
DISCOUNT = 0.99
total = []
for e in range(1000): 
    log_probs, rewards = [], []
    done = False
    state = env.reset()
    while not done:
        mu = agent.forward(torch.from_numpy(state).float())
        distribution = Normal(mu, SIGMA)
        action = distribution.sample().clamp(-2.0,2.0)
        log_probs.append(distribution.log_prob(action))
        state, reward, done, info = env.step([action.item()])
        #reward = abs(state[1])
        rewards.append(reward)
total.append(sum(rewards))

cumulative = 0
d_rewards = np.zeros(len(rewards))
for t in reversed(range(len(rewards))): #get discounted rewards
    cumulative = cumulative * DISCOUNT + rewards[t]
    d_rewards[t] = cumulative
d_rewards -= np.mean(d_rewards) #normalize
d_rewards /= np.std(d_rewards)

loss = 0
for t in range(len(rewards)):
    loss += -log_probs[t] * d_rewards[t] #loss is - log prob * total reward

agent.optimizer.zero_grad()
loss.backward() #update
agent.optimizer.step()

if e%10==0:
    print(e,sum(rewards)) 
    plt.plot(total,color='blue') #plot
    plt.pause(0.0001)
    if keyboard.is_pressed(&quot;space&quot;): #holding space exits training
        raise Exception(&quot;Exited&quot;)



def run(i): #to visualize performance
    for _ in range(i):
        done = False
        state = env.reset()
        while not done:
            env.render()
            distribution = Normal(agent.forward(torch.from_numpy(state).float()), SIGMA)
            action = distribution.sample()
            state,reward,done,info = env.step([action.item()])
        env.close()

David Ireland also wrote this on a different question I had:

The algorithm doesn't change in this situation. Say your NN outputs the mean parameter of the Gaussian, then logπ(at|st) is just the log of the normal density evaluated at the action you took where the mean parameter in the density is the output of your NN. You are then able to backpropagate through this to update the weights of your network.

David · Accepted Answer · 2020-10-02T22:42:06.120

This update rule can still be applied in the continuous domain.

As pointed out in the comments, suppose we are parameterising our policy using a Gaussian distribution, where our neural networks take as input the state we are in and output the parameters of a Gaussian distribution, the mean and the standard deviation which we will denote as $\mu(s, \theta)$ and $\sigma(s, \theta)$ where $s$ shows the dependancy of the state and $\theta$ are the parameters of our network.

I will assume a one-dimensional case for ease of notation but this can be extended to multi-variate cases. Our policy is now defined as $$\pi(a_t | s_t) = \frac{1}{\sqrt{2\pi \sigma(s_t, \theta)^2}} \exp\left(-\frac{1}{2}\left(\frac{a_t - \mu(s_t, \theta)}{\sigma(s_t, \theta)}\right)^2\right).$$

As you can see, we can easily take the logarithm of this and find the derivative with respect to $\theta$, and so nothing changes and the loss you use is the same. You simply evaluate the derivative of the log of your policy with respect to the network parameters, multiply by $v_t$ and $\alpha$ and take a gradient step in this direction.

To implement this (as I'm assuming you don't want to calculate the NN derivatives by hand) then you could do something along the lines of the following in Pytorch.

First you want to pass your state through your NN to get the mean and standard deviation of the Gaussian distribution. Then you want to simulate $z \sim N(0,1)$ and calculate $a = \mu(s,\theta) + \sigma(s, \theta) \times z$ so that $a \sim N( \mu(s, \theta), \sigma(s, \theta))$ -- this is the reparameterisation trick that makes backpropagation through the network easier as it takes the randomness from a source that doesn't depend on the parameters of the network. $a$ is your action that you will execute in your environment and use to calculate the gradient by simply writing the code torch.log(normal_pdf(a, \mu(s, \theta), \sigma(s, \theta)).backward() -- here normal_pdf() is any function in Python that calculates the pdf of a normal distribution for a given point and parameters.

What is the loss for policy gradients with continuous actions?

1 Answers1

Linked