0

I tried to use vanilla AC to solve any of the classic control problems and none of them work. Even for the simple Cartpole environment. I used a shared layer of 128 neuron for both actor and critic network, followed by a RELU and then actor and critic have their own heads. Optimizeris Adam(1e-4), and the learning rate for the average reward is 0.0001. Instead of discounted return, I estimate the average return. This is how I update the parameters.

def learn(self, s, a, r, s_):
    # Convert inputs to PyTorch tensors
    s = torch.tensor(s, dtype=torch.float32).to(device)
    s_ = torch.tensor(s_, dtype=torch.float32).to(device)
    a = torch.tensor(a, dtype=torch.int64).to(device)
    r = torch.tensor(r, dtype=torch.float32).to(device)

    # Compute the TD error (delta)
    critic_value = self.critic_value(s)
    next_critic_value = self.critic_value(s_)
    delta = r + next_critic_value - critic_value - self.R
    critic_loss =  F.mse_loss(r + next_critic_value - self.R, critic_value)


    # Update critic (value function)
    self.critic_optimizer.zero_grad()
    critic_loss.backward()
    self.critic_optimizer.step()

    # Update actor (policy gradient)

    # Get log probability of the action taken
    prob_vec = self.pi(s)
    log_prob = torch.log(prob_vec[a.detach().item()])  # Log probability of taken action

    # Actor loss: negative log probability * advantage (delta)
    actor_loss = -log_prob * delta.detach()
    self.actor_optimizer.zero_grad()
    actor_loss.backward()
    self.actor_optimizer.step()

    # Update estimate of R
    self.R = self.R + self.gamma * delta.detach()

I couldn't find the existing implementation online for vanilla AC. Is it because it doesn't work at all? Has anyone successfully trained the vanilla AC?

Leafstar
  • 11
  • 2

0 Answers0