注意

本教程兼容 Gymnasium 1.3.0 版本。

Taxi 环境中的动作掩码

本教程演示了如何在 Taxi 环境中使用动作掩码(Action Masking),通过防止无效动作来提升强化学习的性能。

Taxi 环境是一个经典的网格世界问题,出租车需要接载乘客并将他们送达目的地。在该环境中,并非所有动作在每个状态下都是有效的——例如,你不能穿墙行驶,也不能在没到乘客位置时接载乘客。

动作掩码是一种通过提供二进制掩码来指示当前状态下哪些动作有效,从而帮助强化学习智能体避免选择无效动作的技术。这可以显著提高学习效率和性能。

理解 Taxi 环境

Taxi 环境有 6 种可能的动作

  • 0:向南移动(下)

  • 1:向北移动(上)

  • 2:向东移动(右)

  • 3:向西移动(左)

  • 4:接载乘客

  • 5:送达乘客

环境在 reset()step() 返回的 info 字典中提供了 action_mask。该掩码是一个二进制数组,其中 1 表示有效动作,0 表示无效动作。

动作掩码的工作原理

动作掩码通过将智能体的动作选择限制在仅有效动作的范围内来发挥作用

  1. 探索期间:在选择随机动作时,我们仅从有效动作集中进行选择

  2. 利用期间:在基于 Q 值选择最佳动作时,我们仅考虑有效动作的 Q 值

  3. Q 学习更新期间:我们仅在下一状态的有效动作中计算最大未来 Q 值

让我们逐步实现这一点

import random
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np

import gymnasium as gym

# Base random seed for reproducibility
BASE_RANDOM_SEED = 58922320


def train_q_learning(
    env,
    use_action_mask: bool = True,
    episodes: int = 5000,
    seed: int = BASE_RANDOM_SEED,
    learning_rate: float = 0.1,
    discount_factor: float = 0.95,
    epsilon: float = 0.1,
) -> dict:
    """Train a Q-learning agent with or without action masking."""
    # Set random seeds for reproducibility
    np.random.seed(seed)
    random.seed(seed)

    # Initialize Q-table
    n_states = env.observation_space.n
    n_actions = env.action_space.n
    q_table = np.zeros((n_states, n_actions))

    # Track episode rewards for analysis
    episode_rewards = []

    for episode in range(episodes):
        # Reset environment
        state, info = env.reset(seed=seed + episode)
        total_reward = 0
        done = False
        truncated = False

        while not (done or truncated):
            # Get action mask if using it
            action_mask = info["action_mask"] if use_action_mask else None

            # Epsilon-greedy action selection with masking
            if np.random.random() < epsilon:
                # Random action selection
                if use_action_mask:
                    # Only select from valid actions
                    valid_actions = np.nonzero(action_mask == 1)[0]
                    action = np.random.choice(valid_actions)
                else:
                    # Select from all actions
                    action = np.random.randint(0, n_actions)
            else:
                # Greedy action selection
                if use_action_mask:
                    # Only consider valid actions for exploitation
                    valid_actions = np.nonzero(action_mask == 1)[0]
                    if len(valid_actions) > 0:
                        action = valid_actions[np.argmax(q_table[state, valid_actions])]
                    else:
                        action = np.random.randint(0, n_actions)
                else:
                    # Consider all actions
                    action = np.argmax(q_table[state])

            # Take action and observe result
            next_state, reward, done, truncated, info = env.step(action)
            total_reward += reward

            # Q-learning update
            if not (done or truncated):
                if use_action_mask:
                    # Only consider valid next actions for bootstrapping
                    next_mask = info["action_mask"]
                    valid_next_actions = np.nonzero(next_mask == 1)[0]
                    if len(valid_next_actions) > 0:
                        next_max = np.max(q_table[next_state, valid_next_actions])
                    else:
                        next_max = 0
                else:
                    # Consider all next actions
                    next_max = np.max(q_table[next_state])

                # Update Q-value
                q_table[state, action] = q_table[state, action] + learning_rate * (
                    reward + discount_factor * next_max - q_table[state, action]
                )

            state = next_state

        episode_rewards.append(total_reward)

    return {
        "episode_rewards": episode_rewards,
        "mean_reward": np.mean(episode_rewards),
        "std_reward": np.std(episode_rewards),
    }

运行实验

现在我们将运行实验来比较使用和不使用动作掩码的 Q 学习智能体的性能。我们将使用多个随机种子以确保统计学上的对比稳健。

实验设置:- 12 次具有不同随机种子的独立运行 - 每次运行 5000 个回合 - 标准 Q 学习超参数 (α=0.1, γ=0.95, ε=0.1)

# Experiment parameters
n_runs = 12
episodes = 5000
learning_rate = 0.1
discount_factor = 0.95
epsilon = 0.1

# Generate different seeds for each run
seeds = [BASE_RANDOM_SEED + i for i in range(n_runs)]

# Store results for comparison
masked_results_list = []
unmasked_results_list = []

# Run experiments with different seeds
for i, seed in enumerate(seeds):
    print(f"Run {i + 1}/{n_runs} with seed {seed}")

    # Train agent WITH action masking
    env_masked = gym.make("Taxi-v4")
    masked_results = train_q_learning(
        env_masked,
        use_action_mask=True,
        seed=seed,
        learning_rate=learning_rate,
        discount_factor=discount_factor,
        epsilon=epsilon,
        episodes=episodes,
    )
    env_masked.close()
    masked_results_list.append(masked_results)

    # Train agent WITHOUT action masking
    env_unmasked = gym.make("Taxi-v4")
    unmasked_results = train_q_learning(
        env_unmasked,
        use_action_mask=False,
        seed=seed,
        learning_rate=learning_rate,
        discount_factor=discount_factor,
        epsilon=epsilon,
        episodes=episodes,
    )
    env_unmasked.close()
    unmasked_results_list.append(unmasked_results)

可视化结果

运行完所有实验后,我们计算统计数据并进行可视化,以对比两种方法的性能。

# Calculate statistics across runs
masked_mean_rewards = [r["mean_reward"] for r in masked_results_list]
unmasked_mean_rewards = [r["mean_reward"] for r in unmasked_results_list]

masked_overall_mean = np.mean(masked_mean_rewards)
masked_overall_std = np.std(masked_mean_rewards)
unmasked_overall_mean = np.mean(unmasked_mean_rewards)
unmasked_overall_std = np.std(unmasked_mean_rewards)

# Create visualization
plt.figure(figsize=(12, 8), dpi=100)

# Plot individual runs with low alpha
for i, (masked_results, unmasked_results) in enumerate(
    zip(masked_results_list, unmasked_results_list, strict=True)
):
    plt.plot(
        masked_results["episode_rewards"],
        label="With Action Masking" if i == 0 else None,
        color="blue",
        alpha=0.1,
    )
    plt.plot(
        unmasked_results["episode_rewards"],
        label="Without Action Masking" if i == 0 else None,
        color="red",
        alpha=0.1,
    )

# Calculate and plot mean curves across all runs
masked_mean_curve = np.mean([r["episode_rewards"] for r in masked_results_list], axis=0)
unmasked_mean_curve = np.mean(
    [r["episode_rewards"] for r in unmasked_results_list], axis=0
)

plt.plot(
    masked_mean_curve, label="With Action Masking (Mean)", color="blue", linewidth=2
)
plt.plot(
    unmasked_mean_curve,
    label="Without Action Masking (Mean)",
    color="red",
    linewidth=2,
)

plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("Training Performance: Q-Learning with vs without Action Masking")
plt.legend()
plt.grid(True, alpha=0.3)

# Save the figure
savefig_folder = Path("_static/img/tutorials/")
savefig_folder.mkdir(parents=True, exist_ok=True)
plt.savefig(
    savefig_folder / "taxi_v3_action_masking_comparison.png",
    bbox_inches="tight",
    dpi=150,
)
plt.show()
../../../_images/taxi_v3_action_masking_comparison.png

结果分析

对比结果展示了使用动作掩码的几个重要优势

动作掩码的关键优势

1. 更快的收敛速度:使用动作掩码的智能体通常学习得更快,因为它们不会浪费时间去探索无效动作。

  1. 更好的性能:通过仅关注有效动作,智能体能够更稳定地获得更高的奖励。

  2. 更稳定的学习:动作掩码消除了与无效动作选择相关的随机性,从而降低了学习过程中的方差。

  3. 实际适用性:在现实场景中,动作掩码可以防止智能体采取可能危险或不可能执行的动作。

关键实现细节提醒

  • 动作选择:我们使用 np.nonzero(action_mask == 1)[0] 来过滤可用动作,仅获取有效动作

  • Q 值更新:在计算最大未来 Q 值时,我们仅考虑下一状态中的有效动作

  • 探索:随机动作的选择被限制在有效动作集内

这种方法确保了智能体永远不会选择无效动作,同时仍保持有效学习所需的探索与利用平衡。