迁移指南 - v0.21 到 v1.0.0

谁应该阅读本指南?

如果您是 Gymnasium 的新用户:您可以跳过此页面!本指南适用于从旧版本 OpenAI Gym 迁移的用户。如果您刚开始接触强化学习,请转到基本用法

如果您正在从 OpenAI Gym 迁移:本指南将帮助您更新代码以与 Gymnasium 兼容。这些更改意义重大,但一旦您理解了其背后的原因,就会发现它们很简单。

如果您正在更新旧教程:许多在线强化学习教程使用旧的 v0.21 API。本指南将向您展示如何使这些代码现代化。

为什么 API 发生了变化?

Gymnasium 是 OpenAI Gym v0.26 的一个分支,该版本引入了与 Gym v0.21 不兼容的重大更改。这些更改并非轻率而为——它们解决了导致强化学习研究和开发更加困难的重要问题。

旧版 API 的主要问题包括: - 回合结束的歧义:单个 done 标志无法区分“任务完成”和“达到时间限制” - 不一致的随机种子:随机数生成不可靠且难以复现 - 渲染复杂性:在不同视觉模式之间切换不必要的复杂 - 复现性问题:细微的错误使得研究结果难以复现

对于仍在使用 v0.21 API 的环境,请参阅兼容性指南

快速参考:完整更改表

组件

v0.21 (旧)

v0.26+ (新)

影响

包导入

import gym

import gymnasium as gym

所有代码

环境重置

obs = env.reset()

obs, info = env.reset()

训练循环

随机种子

env.seed(42)

env.reset(seed=42)

复现性

步进函数

obs, reward, done, info = env.step(action)

obs, reward, terminated, truncated, info = env.step(action)

强化学习算法

回合结束

while not done:

while not (terminated or truncated):

训练循环

渲染模式

env.render(mode="human")

gym.make(env_id, render_mode="human")

可视化

时间限制检测

info.get('TimeLimit.truncated')

truncated 返回值

强化学习算法

价值自举

target = reward + (1-done) * gamma * next_value

target = reward + (1-terminated) * gamma * next_value

强化学习正确性

代码并排比较

旧版 v0.21 代码

import gym

# Environment creation and seeding
env = gym.make("LunarLander-v3", options={})
env.seed(123)
observation = env.reset()

# Training loop
done = False
while not done:
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    env.render(mode="human")

env.close()

新版 v0.26+ 代码(包括 v1.0.0)

import gymnasium as gym  # Note: 'gymnasium' not 'gym'

# Environment creation with render mode specified upfront
env = gym.make("LunarLander-v3", render_mode="human")

# Reset with seed parameter
observation, info = env.reset(seed=123, options={})

# Training loop with terminated/truncated distinction
done = False
while not done:
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)

    # Episode ends if either terminated OR truncated
    done = terminated or truncated

env.close()

关键更改细分

1. 包名称更改

旧版import gym 新版import gymnasium as gym

原因:Gymnasium 是一个独立项目,它维护并改进了原始 Gym 代码库。

# Update your imports
# OLD
import gym

# NEW
import gymnasium as gym

2. 随机种子和随机数生成

最大的概念性变化是随机性的处理方式。

旧版 v0.21:独立的 seed() 方法

env = gym.make("CartPole-v1")
env.seed(42)  # Set random seed
obs = env.reset()  # Reset environment

新版 v0.26+:种子通过 reset() 传递

env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)  # Seed and reset together

为什么会改变:某些环境(特别是模拟游戏)只能在回合开始时设置其随机状态,而不能在回合中间设置。旧方法可能导致不一致的行为。

实际影响:

# OLD: Seeding applied to all future episodes
env.seed(42)
for episode in range(10):
    obs = env.reset()

# NEW: Each episode can have its own seed
for episode in range(10):
    obs, info = env.reset(seed=42 + episode)  # Each episode gets unique seed

3. 环境重置的变化

旧版 v0.21:仅返回观测

observation = env.reset()

新版 v0.26+:返回观测和信息

observation, info = env.reset()

为什么会改变:

  • info 提供对调试信息的一致访问

  • seed 参数实现可复现的回合

  • options 参数允许回合特定配置

常见迁移模式:

# If you don't need the new features, just unpack the tuple
obs, _ = env.reset()  # Ignore info with underscore

# If you want to maintain the same random behavior as v0.21
env.reset(seed=42)  # Set seed once
# Then for subsequent resets:
obs, info = env.reset()  # Uses internal random state

4. 步进函数:doneterminated/truncated 分割

这是训练算法最重要的变化。

旧版 v0.21:单个 done 标志

obs, reward, done, info = env.step(action)

新版 v0.26+:独立的 terminatedtruncated 标志

obs, reward, terminated, truncated, info = env.step(action)

为什么这很重要:

  • terminated:回合因任务完成或失败而结束(智能体达到目标、死亡等)

  • truncated:回合因外部限制而结束(时间限制、步数限制等)

这种区分对于强化学习算法中的价值函数自举至关重要

# OLD (ambiguous)
if done:
    # Should we bootstrap? We don't know if this was natural termination or time limit!
    next_value = 0  # Assumption that may be wrong

# NEW (clear)
if terminated:
    next_value = 0      # Natural ending - no future value
elif truncated:
    next_value = value_function(next_obs)  # Time limit - estimate future value

迁移策略:

# Simple migration (works for many cases)
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated

# Better migration (preserves RL algorithm correctness)
obs, reward, terminated, truncated, info = env.step(action)
if terminated:
    # Episode naturally ended - use reward as-is
    target = reward
elif truncated:
    # Episode cut short - may need to estimate remaining value
    target = reward + discount * estimate_value(obs)

更多信息,请参阅我们关于此的博客文章

5. 渲染模式更改

旧版 v0.21:每次指定渲染模式

env = gym.make("CartPole-v1")
env.render(mode="human")     # Visual window
env.render(mode="rgb_array") # Get pixel array

新版 v0.26+:创建时固定渲染模式

env = gym.make("CartPole-v1", render_mode="human")     # For visual display
env = gym.make("CartPole-v1", render_mode="rgb_array") # For recording
env.render()  # Uses the mode specified at creation

为什么会改变:有些环境无法在运行时切换渲染模式。在创建时固定模式可以实现更好的优化并防止错误。

实际影响:

# OLD: Could switch modes dynamically
env = gym.make("CartPole-v1")
for episode in range(10):
    # ... episode code ...
    if episode % 10 == 0:
        env.render(mode="human")  # Show every 10th episode

# NEW: Create separate environments for different purposes
training_env = gym.make("CartPole-v1")  # No rendering for speed
eval_env = gym.make("CartPole-v1", render_mode="human")  # Visual for evaluation

# Or use None for no rendering, then create visual env when needed
env = gym.make("CartPole-v1", render_mode=None)  # Fast training
if need_visualization:
    visual_env = gym.make("CartPole-v1", render_mode="human")

TimeLimit 封装器更改

TimeLimit 封装器的行为也已更改,以符合新的终止模型。

旧版 v0.21:将 TimeLimit.truncated 添加到 info 字典 ```python obs, reward, done, info = env.step(action) if done and info.get(‘TimeLimit.truncated’, False)

# 回合因时间限制而结束

新版 v0.26+:使用 truncated 返回值

obs, reward, terminated, truncated, info = env.step(action)
if truncated:
    # Episode ended due to time limit (or other truncation)
    pass
if terminated:
    # Episode ended naturally (success/failure)
    pass

这使得时间限制检测更加清晰和明确。


## Updating Your Training Code

### Basic Training Loop Migration

**Old v0.21 pattern**:
```python
for episode in range(num_episodes):
    obs = env.reset()
    done = False

    while not done:
        action = agent.get_action(obs)
        next_obs, reward, done, info = env.step(action)

        # Train agent (this may have bugs due to ambiguous 'done')
        agent.learn(obs, action, reward, next_obs, done)
        obs = next_obs

新版 v0.26+ 模式:

for episode in range(num_episodes):
    obs, info = env.reset(seed=episode)  # Optional: unique seed per episode
    terminated, truncated = False, False

    while not (terminated or truncated):
        action = agent.get_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)

        # Train agent with proper termination handling
        agent.learn(obs, action, reward, next_obs, terminated)
        obs = next_obs

Q-Learning 更新迁移

旧版 v0.21(可能不正确):

def update_q_value(obs, action, reward, next_obs, done):
    if done:
        target = reward  # Assumes all episode endings are natural terminations
    else:
        target = reward + gamma * max(q_table[next_obs])

    q_table[obs][action] += lr * (target - q_table[obs][action])

新版 v0.26+(正确):

def update_q_value(obs, action, reward, next_obs, terminated):
    if terminated:
        # Natural termination - no future value
        target = reward
    else:
        # Episode continues - truncation has no impact on the possible future value
        target = reward + gamma * max(q_table[next_obs])

    q_table[obs][action] += lr * (target - q_table[obs][action])

深度强化学习框架迁移

大多数库已更新,请查阅其文档以获取更多信息。

环境特定更改

已移除的环境

部分环境已被移动或移除

# OLD: Robotics environments in main gym
import gym
env = gym.make("FetchReach-v1")  # No longer available

# NEW: Moved to separate package
import gymnasium

import gymnasium_robotics
import ale_py

gymnasium.register_envs((gymnasium_robotics, ale_py))

env = gymnasium.make("FetchReach-v1")
env = gymnasium.make("ALE/Pong-v5")

兼容性助手

使用旧版环境

如果您需要使用尚未更新到新 API 的环境

```python # 对于仍使用旧版 gym 的环境 env = gym.make(“GymV21Environment-v0”, env_id=”OldEnv-v0”)

# 此封装器自动将旧版 API 转换为新版 API

更多详情,请参阅兼容性指南 <gym_compatibility>_。

测试您的迁移

迁移后,请验证以下几点:

  • [ ] 导入语句使用 gymnasium 而非 gym

  • [ ] 重置调用处理 (obs, info) 返回格式

  • [ ] 步进调用单独处理 terminatedtruncated

  • [ ] 渲染模式在环境创建时指定

  • [ ] 随机种子使用 reset() 中的 seed 参数

  • [ ] 训练算法正确区分终止类型

使用 from gymnasium.utils.env_checker import check_env 来验证它们的实现。

获取帮助

如果您在迁移过程中遇到问题:

  1. 查阅兼容性指南:一些旧环境可以通过兼容性封装器使用

  2. 查阅环境文档:每个环境可能有特定的迁移说明

  3. 首先使用简单环境进行测试:从 CartPole 开始,然后再转移到复杂环境

  4. 比较旧版与新版行为:使用两种 API 运行相同的代码以了解差异

常见资源: