注意

此示例兼容 Gymnasium 1.2.0 版本。

创建您自己的自定义环境¶

本教程展示如何创建新环境，并链接到 Gymnasium 中包含的相关有用包装器、实用程序和测试。

设置¶

替代解决方案¶

使用 Pip 或 Conda 安装 Copier

pip install copier

或

conda install -c conda-forge copier

生成您的环境¶

您可以运行以下命令检查 Copier 是否已正确安装，该命令应输出版本号：

copier --version

然后您只需运行以下命令，并将字符串 path/to/directory 替换为您希望创建新项目的目录路径。

copier copy https://github.com/Farama-Foundation/gymnasium-env-template.git "path/to/directory"

回答问题，完成后您应该会得到如下项目结构：

.
├── gymnasium_env
│        ├── envs
│        │       ├── grid_world.py
│        │       └── __init__.py
│        ├── __init__.py
│        └── wrappers
│            ├── clip_reward.py
│            ├── discrete_actions.py
│            ├── __init__.py
│            ├── reacher_weighted_reward.py
│            └── relative_position.py
├── LICENSE
├── pyproject.toml
└── README.md

继承 gymnasium.Env¶

在学习如何创建您自己的环境之前，您应该查看 Gymnasium API 文档。

为了说明继承 gymnasium.Env 的过程，我们将实现一个非常简单的游戏，称为 GridWorldEnv。我们将在 gymnasium_env/envs/grid_world.py 中编写自定义环境的代码。该环境由一个固定大小的二维方格网格组成（通过构造时的 size 参数指定）。智能体在每个时间步可以在网格单元之间垂直或水平移动。智能体的目标是导航到剧集开始时随机放置在网格上的目标点。

观测值提供目标和智能体的位置。
我们的环境有 4 个动作，分别对应“右”、“上”、“左”和“下”的移动。
一旦智能体导航到目标所在的网格单元，就会发出完成信号。
奖励是二元的且稀疏的，这意味着即时奖励始终为零，除非智能体已到达目标，此时奖励为 1。

此环境中的一集（size=5）可能看起来像这样：

其中蓝点是智能体，红方块代表目标。

让我们逐段查看 GridWorldEnv 的源代码。

声明和初始化¶

我们的自定义环境将继承自抽象类 gymnasium.Env。您不应忘记向您的类添加 metadata 属性。在那里，您应该指定您的环境支持的渲染模式（例如，"human"、"rgb_array"、"ansi"）以及您的环境应渲染的帧率。每个环境都应支持 None 作为渲染模式；您无需在元数据中添加它。在 GridWorldEnv 中，我们将支持“rgb_array”和“human”模式，并以 4 FPS 渲染。

我们环境的 __init__ 方法将接受整数 size，它决定了方格网格的大小。我们将设置一些用于渲染的变量，并定义 self.observation_space 和 self.action_space。在我们的例子中，观测值应该提供关于智能体和目标在二维网格上的位置信息。我们将选择以字典形式表示观测值，键为 "agent" 和 "target"。一个观测值可能看起来像 {"agent": array([1, 0]), "target": array([0, 3])}。由于我们的环境有 4 个动作（“右”、“上”、“左”、“下”），我们将使用 Discrete(4) 作为动作空间。以下是 GridWorldEnv 的声明和 __init__ 的实现：

# gymnasium_env/envs/grid_world.py
from enum import Enum

import numpy as np
import pygame

import gymnasium as gym
from gymnasium import spaces


class Actions(Enum):
    RIGHT = 0
    UP = 1
    LEFT = 2
    DOWN = 3


class GridWorldEnv(gym.Env):
    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 4}

    def __init__(self, render_mode=None, size=5):
        self.size = size  # The size of the square grid
        self.window_size = 512  # The size of the PyGame window

        # Observations are dictionaries with the agent's and the target's location.
        # Each location is encoded as an element of {0, ..., `size`}^2, i.e. MultiDiscrete([size, size]).
        self.observation_space = spaces.Dict(
            {
                "agent": spaces.Box(0, size - 1, shape=(2,), dtype=int),
                "target": spaces.Box(0, size - 1, shape=(2,), dtype=int),
            }
        )
        self._agent_location = np.array([-1, -1], dtype=int)
        self._target_location = np.array([-1, -1], dtype=int)

        # We have 4 actions, corresponding to "right", "up", "left", "down"
        self.action_space = spaces.Discrete(4)

        """
        The following dictionary maps abstract actions from `self.action_space` to
        the direction we will walk in if that action is taken.
        i.e. 0 corresponds to "right", 1 to "up" etc.
        """
        self._action_to_direction = {
            Actions.RIGHT.value: np.array([1, 0]),
            Actions.UP.value: np.array([0, 1]),
            Actions.LEFT.value: np.array([-1, 0]),
            Actions.DOWN.value: np.array([0, -1]),
        }

        assert render_mode is None or render_mode in self.metadata["render_modes"]
        self.render_mode = render_mode

        """
        If human-rendering is used, `self.window` will be a reference
        to the window that we draw to. `self.clock` will be a clock that is used
        to ensure that the environment is rendered at the correct framerate in
        human-mode. They will remain `None` until human-mode is used for the
        first time.
        """
        self.window = None
        self.clock = None

从环境状态构造观测值¶

由于我们既需要在 reset 中也需要在 step 中计算观测值，因此通常方便拥有一个（私有）方法 _get_obs 来将环境状态转换为观测值。然而，这不是强制性的，您也可以在 reset 和 step 中分别计算观测值。

def _get_obs(self):
    return {"agent": self._agent_location, "target": self._target_location}

我们还可以为 step 和 reset 返回的辅助信息实现类似的方法。在我们的例子中，我们希望提供智能体和目标之间的曼哈顿距离。

def _get_info(self):
    return {
        "distance": np.linalg.norm(
            self._agent_location - self._target_location, ord=1
        )
    }

通常，info 中还会包含一些仅在 step 方法内部可用的数据（例如，单个奖励项）。在这种情况下，我们必须在 step 中更新 _get_info 返回的字典。

重置¶

reset 方法将被调用以启动一个新的剧集。您可以假定在调用 reset 之前不会调用 step 方法。此外，每当发出完成信号时，都应调用 reset。用户可以将 seed 关键字传递给 reset，以将环境使用的任何随机数生成器初始化为确定性状态。建议使用环境基类 gymnasium.Env 提供的随机数生成器 self.np_random。如果您只使用此 RNG，则无需过多担心种子设置，但您需要记住调用 super().reset(seed=seed) 以确保 gymnasium.Env 正确地设置 RNG 种子。一旦完成，我们就可以随机设置环境的状态。在我们的例子中，我们随机选择智能体的位置和随机采样的目标位置，直到它与智能体的位置不重合。

reset 方法应返回初始观测值和一些辅助信息的元组。我们可以使用我们之前实现的 _get_obs 和 _get_info 方法来实现这一点：

def reset(self, seed=None, options=None):
    # We need the following line to seed self.np_random
    super().reset(seed=seed)

    # Choose the agent's location uniformly at random
    self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)

    # We will sample the target's location randomly until it does not coincide with the agent's location
    self._target_location = self._agent_location
    while np.array_equal(self._target_location, self._agent_location):
        self._target_location = self.np_random.integers(
            0, self.size, size=2, dtype=int
        )

    observation = self._get_obs()
    info = self._get_info()

    if self.render_mode == "human":
        self._render_frame()

    return observation, info

步进¶

step 方法通常包含您环境的大部分逻辑。它接受一个 action，计算应用该动作后环境的状态，并返回 5 元组 (observation, reward, terminated, truncated, info)。请参阅 gymnasium.Env.step()。一旦环境的新状态被计算出来，我们就可以检查它是否是终止状态，并相应地设置 done。由于我们在 GridWorldEnv 中使用稀疏的二元奖励，一旦我们知道 done，计算 reward 就变得微不足道。为了收集 observation 和 info，我们可以再次利用 _get_obs 和 _get_info：

def step(self, action):
    # Map the action (element of {0,1,2,3}) to the direction we walk in
    direction = self._action_to_direction[action]
    # We use `np.clip` to make sure we don't leave the grid
    self._agent_location = np.clip(
        self._agent_location + direction, 0, self.size - 1
    )
    # An episode is done iff the agent has reached the target
    terminated = np.array_equal(self._agent_location, self._target_location)
    reward = 1 if terminated else 0  # Binary sparse rewards
    observation = self._get_obs()
    info = self._get_info()

    if self.render_mode == "human":
        self._render_frame()

    return observation, reward, terminated, False, info

渲染¶

在这里，我们使用 PyGame 进行渲染。Gymnasium 中包含的许多环境都使用了类似的渲染方法，您可以将其作为您自己环境的骨架。

def render(self):
    if self.render_mode == "rgb_array":
        return self._render_frame()

def _render_frame(self):
    if self.window is None and self.render_mode == "human":
        pygame.init()
        pygame.display.init()
        self.window = pygame.display.set_mode(
            (self.window_size, self.window_size)
        )
    if self.clock is None and self.render_mode == "human":
        self.clock = pygame.time.Clock()

    canvas = pygame.Surface((self.window_size, self.window_size))
    canvas.fill((255, 255, 255))
    pix_square_size = (
        self.window_size / self.size
    )  # The size of a single grid square in pixels

    # First we draw the target
    pygame.draw.rect(
        canvas,
        (255, 0, 0),
        pygame.Rect(
            pix_square_size * self._target_location,
            (pix_square_size, pix_square_size),
        ),
    )
    # Now we draw the agent
    pygame.draw.circle(
        canvas,
        (0, 0, 255),
        (self._agent_location + 0.5) * pix_square_size,
        pix_square_size / 3,
    )

    # Finally, add some gridlines
    for x in range(self.size + 1):
        pygame.draw.line(
            canvas,
            0,
            (0, pix_square_size * x),
            (self.window_size, pix_square_size * x),
            width=3,
        )
        pygame.draw.line(
            canvas,
            0,
            (pix_square_size * x, 0),
            (pix_square_size * x, self.window_size),
            width=3,
        )

    if self.render_mode == "human":
        # The following line copies our drawings from `canvas` to the visible window
        self.window.blit(canvas, canvas.get_rect())
        pygame.event.pump()
        pygame.display.update()

        # We need to ensure that human-rendering occurs at the predefined framerate.
        # The following line will automatically add a delay to keep the framerate stable.
        self.clock.tick(self.metadata["render_fps"])
    else:  # rgb_array
        return np.transpose(
            np.array(pygame.surfarray.pixels3d(canvas)), axes=(1, 0, 2)
        )

关闭¶

close 方法应关闭环境使用的任何开放资源。在许多情况下，您实际上无需费心实现此方法。然而，在我们的示例中，render_mode 可能为 "human"，我们可能需要关闭已打开的窗口。

def close(self):
    if self.window is not None:
        pygame.display.quit()
        pygame.quit()

在其他环境中，close 也可能关闭已打开的文件或释放其他资源。在调用 close 后，您不应再与环境交互。

注册环境¶

为了让 Gymnasium 检测到自定义环境，它们必须按如下方式注册。我们将选择将此代码放在 gymnasium_env/__init__.py 中。

from gymnasium.envs.registration import register

register(
    id="gymnasium_env/GridWorld-v0",
    entry_point="gymnasium_env.envs:GridWorldEnv",
)

环境 ID 由三个组成部分构成，其中两个是可选的：一个可选的命名空间（此处为：gymnasium_env），一个强制性名称（此处为：GridWorld）以及一个可选但推荐的版本号（此处为：v0）。它也可以注册为 GridWorld-v0（推荐方法）、GridWorld 或 gymnasium_env/GridWorld，然后在创建环境时应使用相应的 ID。

关键字参数 max_episode_steps=300 将确保通过 gymnasium.make 实例化的 GridWorld 环境将被包裹在 TimeLimit 包装器中（更多信息请参阅包装器文档）。如果智能体已达到目标或当前剧集中已执行了 300 步，则将产生完成信号。要区分截断和终止，您可以检查 info["TimeLimit.truncated"]。

除了 id 和 entrypoint 之外，您还可以将以下附加关键字参数传递给 register：

名称	类型	默认	描述
`reward_threshold`	`float`	`None`	任务被认为已解决前的奖励阈值
`nondeterministic`	`bool`	`False`	即使在设置种子后，此环境是否仍是非确定性的
`max_episode_steps`	`int`	`None`	一个剧集可以包含的最大步数。如果不是 `None`，则会添加一个 `TimeLimit` 包装器
`order_enforce`	`bool`	`True`	是否将环境包装在 `OrderEnforcing` 包装器中
`kwargs`	`dict`	`{}`	传递给环境类的默认关键字参数

这些关键字（除了 max_episode_steps、order_enforce 和 kwargs）大多数不改变环境实例的行为，而只是提供一些关于您环境的额外信息。注册后，我们的自定义 GridWorldEnv 环境可以通过 env = gymnasium.make('gymnasium_env/GridWorld-v0') 创建。

gymnasium_env/envs/__init__.py 应包含：

from gymnasium_env.envs.grid_world import GridWorldEnv

如果您的环境未注册，您可以选择传递一个要导入的模块，该模块将在创建环境之前注册您的环境，例如：env = gymnasium.make('module:Env-v0')，其中 module 包含注册代码。对于 GridWorld 环境，注册代码通过导入 gymnasium_env 来运行，因此如果无法显式导入 gymnasium_env，您可以在创建时通过 env = gymnasium.make('gymnasium_env:gymnasium_env/GridWorld-v0') 进行注册。这在您只允许将环境 ID 传递给第三方代码库（例如，学习库）时特别有用。这允许您注册环境而无需编辑库的源代码。

创建包¶

最后一步是将我们的代码结构化为一个 Python 包。这涉及到配置 pyproject.toml。一个如何操作的最小示例如下：

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "gymnasium_env"
version = "0.0.1"
dependencies = [
  "gymnasium",
  "pygame==2.1.3",
  "pre-commit",
]

创建环境实例¶

现在您可以使用以下命令在本地安装您的包：

pip install -e .

然后您可以通过以下方式创建环境实例：

# run_gymnasium_env.py

import gymnasium
import gymnasium_env
env = gymnasium.make('gymnasium_env/GridWorld-v0')

您还可以将环境构造函数的关键字参数传递给 gymnasium.make 以自定义环境。在我们的例子中，我们可以这样做：

env = gymnasium.make('gymnasium_env/GridWorld-v0', size=10)

有时，您可能会发现跳过注册并自己调用环境的构造函数更方便。有些人可能会觉得这种方法更具 Pythonic 风格，并且以这种方式实例化的环境也完全没问题（但请记住也要添加包装器！）。

使用包装器¶

通常，我们希望使用自定义环境的不同变体，或者我们希望修改 Gymnasium 或其他方提供的环境的行为。包装器允许我们这样做，而无需更改环境实现或添加任何样板代码。请查看包装器文档以获取有关如何使用包装器和实现您自己的包装器的详细信息。在我们的示例中，观测值无法直接用于学习代码，因为它们是字典。然而，我们实际上无需触及环境实现即可解决此问题！我们可以简单地在环境实例之上添加一个包装器，将观测值扁平化为一个单一的数组：

import gymnasium
import gymnasium_env
from gymnasium.wrappers import FlattenObservation

env = gymnasium.make('gymnasium_env/GridWorld-v0')
wrapped_env = FlattenObservation(env)
print(wrapped_env.reset())     # E.g.  [3 0 3 3], {}

包装器的一个巨大优势在于它们使环境具有高度的模块化性。例如，与其将 GridWorld 的观测值扁平化，您可能只希望查看目标和智能体的相对位置。在观测包装器部分，我们实现了一个完成这项工作的包装器。此包装器在 gymnasium_env/wrappers/relative_position.py 中也可用。

import gymnasium
import gymnasium_env
from gymnasium_env.wrappers import RelativePosition

env = gymnasium.make('gymnasium_env/GridWorld-v0')
wrapped_env = RelativePosition(env)
print(wrapped_env.reset())     # E.g.  [-3  3], {}

创建您自己的自定义环境¶

设置¶

推荐解决方案¶

替代解决方案¶

生成您的环境¶

继承 gymnasium.Env¶

声明和初始化¶

从环境状态构造观测值¶

重置¶

步进¶

渲染¶

关闭¶

注册环境¶

创建包¶

创建环境实例¶

使用包装器¶