As I'm new to the AI/ML field, I'm still learning from various online materials. In this particular instance, I've been studying the Reinforcement Learning tutorial by deeplizard, specifically focusing on videos 8 through 10.
After attempting to replicate the example that demonstrates how to train an agent in the gym's FrozenLake environment, I encountered some unexpected issues. Although I successfully debugged the code and got the algorithm to work, I can't confidently claim that I fully comprehend if this is the correct approach. Therefore, I have specific questions related to this exercise that I intend to post within this single thread. Please feel free to answer any of the questions at your convenience.
Q1
What is this gym environment warning all about, and how should it be properly addressed?
/home/boris/.local/lib/python3.11/site-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24)
if not isinstance(terminated, (bool, np.bool8)):
So far, I've discovered two methods to handle this warning: One is to simply suppress the warning as follows:
from warnings import filterwarnings
filterwarnings(action='ignore', category=DeprecationWarning, message='`np.bool8` is a deprecated alias')
Another is to replace the gym environment with the gymnasium environment, which does not produce this warning.
#import gym
import gymnasium as gym
This brings me to my second question.
Q2
Why are there two environments, gym and gymnasium, that do the same thing?
Most online examples use gym, but I believe gymnasium is a better choice. Am I mistaken?
Q3
What's the distinction between "terminated" and "truncated"? I've made numerous attempts, but I couldn't find a situation where they diverge. They consistently evaluate to either both True or both False.
In the current 'gym' version, I have to use "FrozenLake-v1" (originally, "FrozenLake-v0" was used in the tutorial), and this modification causes the code to break because the environment now expects 5 values instead of 4. Initially, I fixed this issue as follows:
new_state, reward, done, UNKNOWN, info = env.step(action)
I discovered later that the correct name for this variable is "truncated." Consequently, I adjusted the code as follows:
new_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
Nonetheless, I remain perplexed regarding the usage of "terminated" and "truncated." While experimenting with various algorithm options, such as max_steps_per_episode, I was unable to get them to differ. In all cases, they either both evaluate to True or both to False.
Q4
If I want to continue training the agent in stages and test its intermediate performance, how can I do that? I was thinking it's just a matter of saving the q_table, then loading the saved q_table and replacing the current q_table initialization with the new one before resuming training. Or is it not that simple?
saved_q_table = anydbm.open(filename, 'r')
#q_table = np.zeros((state_space_size, action_space_size))
q_table = pickle.loads(saved_q_table)
Q5
Is there a way to get some documentation or help about the environment?
I'm seeking a resource that I can print and reference while coding to gain a better understanding of the environment's parameters. For instance, in scikit-learn, you can achieve this with the following code:
from sklearn.neighbors import KNeighborsRegressor
from pprint import pprint
pprint(KNeighborsRegressor().get_params())
Resulting in following output:
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 5,
'p': 2,
'weights': 'uniform'}
It would be great if one could get relevant agent/states parameters from the gym environments in a similar way.
Q6
How to use render_mode="human" correctly? Currently, I'm using render_mode="ansi" and rendering the environment as follows:
env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=True, render_mode="ansi", desc=None)
print("\033c")
env.render()
print(env.render())
However, when I switch to render_mode="human", the environment automatically displays without the need for env.render(). Consequently, the environment renders during training as well, leading to extremely slow training. How can I disable this automatic rendering when using render_mode="human"?
Q7
Why doesn't this algorithm work for an 8x8 map size?
env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=True, render_mode="ansi", desc=None)
It appears that the agent is being trained, but it fails to learn anything.
*** Average reward per thousand episodes ***
1000 : 0.0
2000 : 0.0
3000 : 0.0
4000 : 0.0
5000 : 0.0
6000 : 0.0
7000 : 0.0
8000 : 0.0
9000 : 0.0
10000 : 0.0
I can't seem to figure out what needs to be modified. The q_table is the correct size:
q_table = np.zeros((state_space_size, action_space_size))
I've also increased max_steps_per_episode to 3000. Is it possible that num_episodes = 10000 isn't sufficient?
By the way, here is the full code I have so far:
#!/usr/bin/python
-- coding: utf-8 --
import numpy as np
#import gym
import gymnasium as gym
import random
import time
from warnings import filterwarnings
#filterwarnings(action='ignore', category=DeprecationWarning, message='np.bool8 is a deprecated alias')
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="ansi", desc=None)
#env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="human", desc=None)
action_space_size = env.action_space.n
state_space_size = env.observation_space.n
#print(action_space_size)
#print(state_space_size)
q_table = np.zeros((state_space_size, action_space_size))
num_episodes = 10000
max_steps_per_episode = 3000
learning_rate = 0.11
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate = 0.99
min_exploration_rate = 0.01
exploration_decay_rate = 0.002
rewards_all_episodes = []
Q-Learnig Algorithm
for episode in range(num_episodes):
print("*** Episode ", episode+1, " ***", end="\r")
state = env.reset()[0]
done = False
rewards_current_episode = 0
for step in range(max_steps_per_episode):
# Exploration-Exploitation Trade-off
exploration_rate_threshold = random.uniform(0, 1)
if exploration_rate_threshold > exploration_rate:
action = np.argmax(q_table[state, :])
else:
action = env.action_space.sample()
new_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
# Update Q-Table for Q(s,a)
q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
state = new_state
rewards_current_episode += reward
if done == True:
break
# Exploration rate decay
exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
rewards_all_episodes.append(rewards_current_episode)
Calsulate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000
print("*** Average reward per thousand episodes ***\n")
for r in rewards_per_thousand_episodes:
print(count, ": ", str(sum(r/1000)))
count += 1000
Print updated Q-Table
print("\n\n*** Updated Q-Table ***\n")
print(q_table)
input("Press Enter to continue...")
num_episodes = 5
max_steps_per_episode = 99
for episode in range(num_episodes):
state = env.reset()[0]
done = False
for step in range(max_steps_per_episode):
print("\033c")
print("*** Episode ", episode+1, " ***\n", step)
env.render()
print(env.render())
time.sleep(0.1)
action = np.argmax(q_table[state, :])
new_state, reward, done, unknown, info = env.step(action)
# Update Q-Table for Q(s,a)
q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
state = new_state
rewards_current_episode += reward
if done == True:
print("\033c")
print("*** Episode ", episode+1, " ***\n")
env.render()
print(env.render())
if reward == 1:
print("*** You reached the goal! ***")
time.sleep(3)
else:
print("*** You fell through a hole! ***")
time.sleep(3)
#clear_output(wait=True)
break
if step == max_steps_per_episode - 1:
print("*** You failed to complete the task! ***")
time.sleep(3)