I tried to use vanilla AC to solve any of the classic control problems and none of them work. Even for the simple Cartpole environment. I used a shared layer of 128 neuron for both actor and critic network, followed by a RELU and then actor and critic have their own heads. Optimizeris Adam(1e-4), and the learning rate for the average reward is 0.0001. Instead of discounted return, I estimate the average return. This is how I update the parameters.
def learn(self, s, a, r, s_):
# Convert inputs to PyTorch tensors
s = torch.tensor(s, dtype=torch.float32).to(device)
s_ = torch.tensor(s_, dtype=torch.float32).to(device)
a = torch.tensor(a, dtype=torch.int64).to(device)
r = torch.tensor(r, dtype=torch.float32).to(device)
# Compute the TD error (delta)
critic_value = self.critic_value(s)
next_critic_value = self.critic_value(s_)
delta = r + next_critic_value - critic_value - self.R
critic_loss = F.mse_loss(r + next_critic_value - self.R, critic_value)
# Update critic (value function)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
# Update actor (policy gradient)
# Get log probability of the action taken
prob_vec = self.pi(s)
log_prob = torch.log(prob_vec[a.detach().item()]) # Log probability of taken action
# Actor loss: negative log probability * advantage (delta)
actor_loss = -log_prob * delta.detach()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
# Update estimate of R
self.R = self.R + self.gamma * delta.detach()
I couldn't find the existing implementation online for vanilla AC. Is it because it doesn't work at all? Has anyone successfully trained the vanilla AC?