1

As I'm new to the AI/ML field, I'm still learning from various online materials. In this particular instance, I've been studying the Reinforcement Learning tutorial by deeplizard, specifically focusing on videos 8 through 10.

After attempting to replicate the example that demonstrates how to train an agent in the gym's FrozenLake environment, I encountered some unexpected issues. Although I successfully debugged the code and got the algorithm to work, I can't confidently claim that I fully comprehend if this is the correct approach. Therefore, I have specific questions related to this exercise that I intend to post within this single thread. Please feel free to answer any of the questions at your convenience.

Q1

What is this gym environment warning all about, and how should it be properly addressed?

/home/boris/.local/lib/python3.11/site-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  if not isinstance(terminated, (bool, np.bool8)):

So far, I've discovered two methods to handle this warning: One is to simply suppress the warning as follows:

from warnings import filterwarnings
filterwarnings(action='ignore', category=DeprecationWarning, message='`np.bool8` is a deprecated alias')

Another is to replace the gym environment with the gymnasium environment, which does not produce this warning.

#import gym
import gymnasium as gym

This brings me to my second question.

Q2

Why are there two environments, gym and gymnasium, that do the same thing? Most online examples use gym, but I believe gymnasium is a better choice. Am I mistaken?

Q3

What's the distinction between "terminated" and "truncated"? I've made numerous attempts, but I couldn't find a situation where they diverge. They consistently evaluate to either both True or both False.

In the current 'gym' version, I have to use "FrozenLake-v1" (originally, "FrozenLake-v0" was used in the tutorial), and this modification causes the code to break because the environment now expects 5 values instead of 4. Initially, I fixed this issue as follows:

new_state, reward, done, UNKNOWN, info = env.step(action)

I discovered later that the correct name for this variable is "truncated." Consequently, I adjusted the code as follows:

new_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated

Nonetheless, I remain perplexed regarding the usage of "terminated" and "truncated." While experimenting with various algorithm options, such as max_steps_per_episode, I was unable to get them to differ. In all cases, they either both evaluate to True or both to False.

Q4

If I want to continue training the agent in stages and test its intermediate performance, how can I do that? I was thinking it's just a matter of saving the q_table, then loading the saved q_table and replacing the current q_table initialization with the new one before resuming training. Or is it not that simple?

saved_q_table = anydbm.open(filename, 'r')

#q_table = np.zeros((state_space_size, action_space_size)) q_table = pickle.loads(saved_q_table)

Q5

Is there a way to get some documentation or help about the environment?

I'm seeking a resource that I can print and reference while coding to gain a better understanding of the environment's parameters. For instance, in scikit-learn, you can achieve this with the following code:

from sklearn.neighbors import KNeighborsRegressor
from pprint import pprint
pprint(KNeighborsRegressor().get_params())

Resulting in following output:

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

It would be great if one could get relevant agent/states parameters from the gym environments in a similar way.

Q6

How to use render_mode="human" correctly? Currently, I'm using render_mode="ansi" and rendering the environment as follows:

env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=True, render_mode="ansi", desc=None)

print("\033c") env.render() print(env.render())

However, when I switch to render_mode="human", the environment automatically displays without the need for env.render(). Consequently, the environment renders during training as well, leading to extremely slow training. How can I disable this automatic rendering when using render_mode="human"?

Q7

Why doesn't this algorithm work for an 8x8 map size?

env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=True, render_mode="ansi", desc=None)

It appears that the agent is being trained, but it fails to learn anything.

*** Average reward per thousand episodes ***

1000 : 0.0 2000 : 0.0 3000 : 0.0 4000 : 0.0 5000 : 0.0 6000 : 0.0 7000 : 0.0 8000 : 0.0 9000 : 0.0 10000 : 0.0

I can't seem to figure out what needs to be modified. The q_table is the correct size:

q_table = np.zeros((state_space_size, action_space_size))

I've also increased max_steps_per_episode to 3000. Is it possible that num_episodes = 10000 isn't sufficient?

By the way, here is the full code I have so far:

#!/usr/bin/python

-- coding: utf-8 --

import numpy as np #import gym import gymnasium as gym import random import time from warnings import filterwarnings

#filterwarnings(action='ignore', category=DeprecationWarning, message='np.bool8 is a deprecated alias')

env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="ansi", desc=None) #env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=True, render_mode="human", desc=None)

action_space_size = env.action_space.n state_space_size = env.observation_space.n #print(action_space_size) #print(state_space_size)

q_table = np.zeros((state_space_size, action_space_size))

num_episodes = 10000 max_steps_per_episode = 3000

learning_rate = 0.11 discount_rate = 0.99

exploration_rate = 1 max_exploration_rate = 0.99 min_exploration_rate = 0.01 exploration_decay_rate = 0.002

rewards_all_episodes = []

Q-Learnig Algorithm

for episode in range(num_episodes): print("*** Episode ", episode+1, " ***", end="\r") state = env.reset()[0]

done = False
rewards_current_episode = 0

for step in range(max_steps_per_episode):

    # Exploration-Exploitation Trade-off
    exploration_rate_threshold = random.uniform(0, 1)
    if exploration_rate_threshold > exploration_rate:
        action = np.argmax(q_table[state, :])
    else:
        action = env.action_space.sample()
    new_state, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

    # Update Q-Table for Q(s,a)
    q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

    state = new_state
    rewards_current_episode += reward

    if done == True:
        break

# Exploration rate decay
exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)

rewards_all_episodes.append(rewards_current_episode)

Calsulate and print the average reward per thousand episodes

rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000) count = 1000 print("*** Average reward per thousand episodes ***\n") for r in rewards_per_thousand_episodes: print(count, ": ", str(sum(r/1000))) count += 1000

Print updated Q-Table

print("\n\n*** Updated Q-Table ***\n") print(q_table) input("Press Enter to continue...")

num_episodes = 5 max_steps_per_episode = 99

for episode in range(num_episodes): state = env.reset()[0]

done = False

for step in range(max_steps_per_episode):
    print("\033c")
    print("*** Episode ",  episode+1, " ***\n", step)
    env.render()
    print(env.render())
    time.sleep(0.1)

    action = np.argmax(q_table[state, :])
    new_state, reward, done, unknown, info = env.step(action)

    # Update Q-Table for Q(s,a)
    q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))

    state = new_state
    rewards_current_episode += reward

    if done == True:
        print("\033c")
        print("*** Episode ",  episode+1, " ***\n")
        env.render()
        print(env.render())
        if reward == 1:
            print("*** You reached the goal! ***")
            time.sleep(3)
        else:
            print("*** You fell through a hole! ***")
            time.sleep(3)
        #clear_output(wait=True)
        break

    if step == max_steps_per_episode - 1:
        print("*** You failed to complete the task! ***")
        time.sleep(3)

Boris L.
  • 119
  • 5

1 Answers1

2

Q1. What is this gym environment warning all about, and how should it be properly addressed?

The development of gym has ended I think. Anyway the error is about an old usage of numpy, so probably you have installed a newer version of it.

Q2. Why are there two environments, gym and gymnasium, that do the same thing?

In principle, they do the same thing but gym was developed by OpenAI and gymnasium by Farama Foundation, which inherited gym basically. You should use gymnasium from now on, since is in actual development. This means that initially gym and gymnasium were almost identical but now their are diverging, even if the underlying concepts are still the same.

Q3. What's the distinction between "terminated" and "truncated"?

truncated indicates that the episode ended due to a time limit, i.e. the current timestep have exceeded the time limit so the episode ends. This also indicated that you have to bootstrap the Q value estimate.

terminated instead is when the agent have reached a terminal state, therefore the episode naturally ends. You should use both to understand when an episode terminates, e.g. if truncated or terminated: ..., but the two flags are useful to determine when or not to bootstrap: you don't do that on termination, for example.

Q4. If I want to continue training the agent in stages and test its intermediate performance, how can I do that?

You need to save the Q-table (or, in general, the weights/parameters of a function approximator like a neural-net) but also the current value of the episilon (if using some schedule, so if changing during the episodes) and similar. Then you instantiate the agent, load the various parameters, and continue training where you left.

Q5. Is there a way to get some documentation or help about the environment?

Try help(env), the official website (https://gymnasium.farama.org/index.html), or read the env's source code.

Q6. How to use render_mode="human" correctly?

If you set render_mode="human" gymnasium will render at each step() and even reset(): this is something that gym not used to do.

I'm not sure but you may need to wrap the env with this gymnasium.wrappers.RenderCollection (docs) to trigger the rendering only when render() is called - I haven't tested that, so I can be wrong.

Luca Anzalone
  • 3,216
  • 4
  • 15