记录智能体¶

为什么要记录您的智能体？¶

记录智能体行为在强化学习开发中具有几个重要目的

🎥 视觉理解：精确地看到您的智能体正在做什么——有时一个10秒的视频就能揭示盯着奖励图数小时都发现不了的问题。

📊 性能跟踪：收集关于回合奖励、时长和时间安排的系统数据，以了解训练进度。

🐛 调试：识别特定的故障模式、异常行为，或者您的智能体遇到困难的环境。

📈 评估：客观地比较不同的训练运行、算法或超参数。

🎓 交流：与合作者分享结果，纳入论文，或创建教育内容。

何时记录¶

评估期间（记录每个回合）

测试训练好的智能体的最终性能
创建演示视频
对特定行为进行详细分析

训练期间（定期记录）

随时间监控学习进度
及早发现训练问题
创建学习的延时视频

Gymnasium 提供了两个重要的封装器用于记录：RecordEpisodeStatistics 用于数值数据记录，以及 RecordVideo 用于视频记录。前者跟踪回合指标，如总奖励、回合长度和耗时。后者使用环境渲染生成智能体行为的 MP4 视频。

我们将展示如何在两种常见场景中使用这些封装器：记录每个回合的数据（通常在评估期间）和定期记录数据（在训练期间）。

记录每个回合（评估）¶

在评估训练好的智能体时，通常需要记录多个回合，以了解平均性能和一致性。下面是如何使用 RecordEpisodeStatistics 和 RecordVideo 进行设置。

import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
import numpy as np

# Configuration
num_eval_episodes = 4
env_name = "CartPole-v1"  # Replace with your environment

# Create environment with recording capabilities
env = gym.make(env_name, render_mode="rgb_array")  # rgb_array needed for video recording

# Add video recording for every episode
env = RecordVideo(
    env,
    video_folder="cartpole-agent",    # Folder to save videos
    name_prefix="eval",               # Prefix for video filenames
    episode_trigger=lambda x: True    # Record every episode
)

# Add episode statistics tracking
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)

print(f"Starting evaluation for {num_eval_episodes} episodes...")
print(f"Videos will be saved to: cartpole-agent/")

for episode_num in range(num_eval_episodes):
    obs, info = env.reset()
    episode_reward = 0
    step_count = 0

    episode_over = False
    while not episode_over:
        # Replace this with your trained agent's policy
        action = env.action_space.sample()  # Random policy for demonstration

        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        step_count += 1

        episode_over = terminated or truncated

    print(f"Episode {episode_num + 1}: {step_count} steps, reward = {episode_reward}")

env.close()

# Print summary statistics
print(f'\nEvaluation Summary:')
print(f'Episode durations: {list(env.time_queue)}')
print(f'Episode rewards: {list(env.return_queue)}')
print(f'Episode lengths: {list(env.length_queue)}')

# Calculate some useful metrics
avg_reward = np.sum(env.return_queue)
avg_length = np.sum(env.length_queue)
std_reward = np.std(env.return_queue)

print(f'\nAverage reward: {avg_reward:.2f} ± {std_reward:.2f}')
print(f'Average episode length: {avg_length:.1f} steps')
print(f'Success rate: {sum(1 for r in env.return_queue if r > 0) / len(env.return_queue):.1%}')

理解输出¶

运行此代码后，您将找到：

视频文件：cartpole-agent/eval-episode-0.mp4、eval-episode-1.mp4 等。

每个文件显示一个从开始到结束的完整回合
有助于精确了解您的智能体如何表现
可以共享、嵌入演示文稿或逐帧分析

控制台输出：逐回合性能加上汇总统计数据

Episode 1: 23 steps, reward = 23.0
Episode 2: 15 steps, reward = 15.0
Episode 3: 200 steps, reward = 200.0
Episode 4: 67 steps, reward = 67.0

Average reward: 76.25 ± 78.29
Average episode length: 76.2 steps
Success rate: 100.0%

统计队列：每个回合的时间、奖励和长度数据

env.time_queue：每个回合耗时（实际时间）
env.return_queue：每个回合的总奖励
env.length_queue：每个回合的步数

在上面的脚本中，RecordVideo 封装器将视频保存为“eval-episode-0.mp4”等文件名到指定文件夹。 episode_trigger=lambda x: True 确保每个回合都被记录。

RecordEpisodeStatistics 封装器跟踪内部队列中的性能指标，我们可以在评估后访问这些队列来计算平均值和其他统计数据。

为了评估时的计算效率，可以使用向量环境来实现，以并行而非顺序地评估 N 个回合。

训练期间记录（定期）¶

在训练期间，您将运行数百或数千个回合，因此记录每个回合是不切实际的。相反，定期记录以跟踪学习进度。

import logging
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo

# Training configuration
training_period = 250           # Record video every 250 episodes
num_training_episodes = 10_000  # Total training episodes
env_name = "CartPole-v1"

# Set up logging for episode statistics
logging.basicConfig(level=logging.INFO, format='%(message)s')

# Create environment with periodic video recording
env = gym.make(env_name, render_mode="rgb_array")

# Record videos periodically (every 250 episodes)
env = RecordVideo(
    env,
    video_folder="cartpole-training",
    name_prefix="training",
    episode_trigger=lambda x: x % training_period == 0  # Only record every 250th episode
)

# Track statistics for every episode (lightweight)
env = RecordEpisodeStatistics(env)

print(f"Starting training for {num_training_episodes} episodes")
print(f"Videos will be recorded every {training_period} episodes")
print(f"Videos saved to: cartpole-training/")

for episode_num in range(num_training_episodes):
    obs, info = env.reset()
    episode_over = False

    while not episode_over:
        # Replace with your actual training agent
        action = env.action_space.sample()  # Random policy for demonstration
        obs, reward, terminated, truncated, info = env.step(action)
        episode_over = terminated or truncated

    # Log episode statistics (available in info after episode ends)
    if "episode" in info:
        episode_data = info["episode"]
        logging.info(f"Episode {episode_num}: "
                    f"reward={episode_data['r']:.1f}, "
                    f"length={episode_data['l']}, "
                    f"time={episode_data['t']:.2f}s")

        # Additional analysis for milestone episodes
        if episode_num % 1000 == 0:
            # Look at recent performance (last 100 episodes)
            recent_rewards = list(env.return_queue)[-100:]
            if recent_rewards:
                avg_recent = sum(recent_rewards) / len(recent_rewards)
                print(f"  -> Average reward over last 100 episodes: {avg_recent:.1f}")

env.close()

训练记录的好处¶

进度视频：观看您的智能体随时间改进

training-episode-0.mp4：随机初始行为
training-episode-250.mp4：一些模式开始出现
training-episode-500.mp4：明显改进
training-episode-1000.mp4：表现出色

学习曲线：绘制随时间变化的每回合统计数据

import matplotlib.pyplot as plt

# Plot learning progress
episodes = range(len(env.return_queue))
rewards = list(env.return_queue)

plt.figure(figsize=(10, 6))
plt.plot(episodes, rewards, alpha=0.3, label='Episode Rewards')

# Add moving average for clearer trend
window = 100
if len(rewards) > window:
    moving_avg = [sum(rewards[i:i+window])/window
                  for i in range(len(rewards)-window+1)]
    plt.plot(range(window-1, len(rewards)), moving_avg,
             label=f'{window}-Episode Moving Average', linewidth=2)

plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Learning Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

与实验跟踪集成¶

对于更复杂的项目，与实验跟踪工具集成。

# Example with Weights & Biases (wandb)
import wandb

# Initialize experiment tracking
wandb.init(project="cartpole-training", name="q-learning-run-1")

# Log episode statistics
for episode_num in range(num_training_episodes):
    # ... training code ...

    if "episode" in info:
        episode_data = info["episode"]
        wandb.log({
            "episode": episode_num,
            "reward": episode_data['r'],
            "length": episode_data['l'],
            "episode_time": episode_data['t']
        })

        # Upload videos periodically
        if episode_num % training_period == 0:
            video_path = f"cartpole-training/training-episode-{episode_num}.mp4"
            if os.path.exists(video_path):
                wandb.log({"training_video": wandb.Video(video_path)})

最佳实践总结¶

对于评估:

记录每个回合以获得完整的性能概览
使用多个种子以确保统计显著性
同时保存视频和数值数据
计算指标的置信区间

对于训练:

定期记录（每100-1000回合）
训练期间侧重于回合统计数据而非视频
对有趣的回合使用自适应记录触发器
监控长时间训练的内存使用情况

对于分析:

创建移动平均线以平滑嘈杂的学习曲线
在成功和失败的回合中寻找模式
比较智能体在训练不同阶段的行为
保存原始数据以备后续分析和比较