记录智能体

为什么要记录您的智能体?

记录智能体行为在强化学习开发中具有几个重要目的

🎥 视觉理解:精确地看到您的智能体正在做什么——有时一个10秒的视频就能揭示盯着奖励图数小时都发现不了的问题。

📊 性能跟踪:收集关于回合奖励、时长和时间安排的系统数据,以了解训练进度。

🐛 调试:识别特定的故障模式、异常行为,或者您的智能体遇到困难的环境。

📈 评估:客观地比较不同的训练运行、算法或超参数。

🎓 交流:与合作者分享结果,纳入论文,或创建教育内容。

何时记录

评估期间(记录每个回合)

  • 测试训练好的智能体的最终性能

  • 创建演示视频

  • 对特定行为进行详细分析

训练期间(定期记录)

  • 随时间监控学习进度

  • 及早发现训练问题

  • 创建学习的延时视频

Gymnasium 提供了两个重要的封装器用于记录:RecordEpisodeStatistics 用于数值数据记录,以及 RecordVideo 用于视频记录。前者跟踪回合指标,如总奖励、回合长度和耗时。后者使用环境渲染生成智能体行为的 MP4 视频。

我们将展示如何在两种常见场景中使用这些封装器:记录每个回合的数据(通常在评估期间)和定期记录数据(在训练期间)。

记录每个回合(评估)

在评估训练好的智能体时,通常需要记录多个回合,以了解平均性能和一致性。下面是如何使用 RecordEpisodeStatisticsRecordVideo 进行设置。

import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
import numpy as np

# Configuration
num_eval_episodes = 4
env_name = "CartPole-v1"  # Replace with your environment

# Create environment with recording capabilities
env = gym.make(env_name, render_mode="rgb_array")  # rgb_array needed for video recording

# Add video recording for every episode
env = RecordVideo(
    env,
    video_folder="cartpole-agent",    # Folder to save videos
    name_prefix="eval",               # Prefix for video filenames
    episode_trigger=lambda x: True    # Record every episode
)

# Add episode statistics tracking
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)

print(f"Starting evaluation for {num_eval_episodes} episodes...")
print(f"Videos will be saved to: cartpole-agent/")

for episode_num in range(num_eval_episodes):
    obs, info = env.reset()
    episode_reward = 0
    step_count = 0

    episode_over = False
    while not episode_over:
        # Replace this with your trained agent's policy
        action = env.action_space.sample()  # Random policy for demonstration

        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        step_count += 1

        episode_over = terminated or truncated

    print(f"Episode {episode_num + 1}: {step_count} steps, reward = {episode_reward}")

env.close()

# Print summary statistics
print(f'\nEvaluation Summary:')
print(f'Episode durations: {list(env.time_queue)}')
print(f'Episode rewards: {list(env.return_queue)}')
print(f'Episode lengths: {list(env.length_queue)}')

# Calculate some useful metrics
avg_reward = np.sum(env.return_queue)
avg_length = np.sum(env.length_queue)
std_reward = np.std(env.return_queue)

print(f'\nAverage reward: {avg_reward:.2f} ± {std_reward:.2f}')
print(f'Average episode length: {avg_length:.1f} steps')
print(f'Success rate: {sum(1 for r in env.return_queue if r > 0) / len(env.return_queue):.1%}')

理解输出

运行此代码后,您将找到:

视频文件cartpole-agent/eval-episode-0.mp4eval-episode-1.mp4 等。

  • 每个文件显示一个从开始到结束的完整回合

  • 有助于精确了解您的智能体如何表现

  • 可以共享、嵌入演示文稿或逐帧分析

控制台输出:逐回合性能加上汇总统计数据

Episode 1: 23 steps, reward = 23.0
Episode 2: 15 steps, reward = 15.0
Episode 3: 200 steps, reward = 200.0
Episode 4: 67 steps, reward = 67.0

Average reward: 76.25 ± 78.29
Average episode length: 76.2 steps
Success rate: 100.0%

统计队列:每个回合的时间、奖励和长度数据

  • env.time_queue:每个回合耗时(实际时间)

  • env.return_queue:每个回合的总奖励

  • env.length_queue:每个回合的步数

在上面的脚本中,RecordVideo 封装器将视频保存为“eval-episode-0.mp4”等文件名到指定文件夹。 episode_trigger=lambda x: True 确保每个回合都被记录。

RecordEpisodeStatistics 封装器跟踪内部队列中的性能指标,我们可以在评估后访问这些队列来计算平均值和其他统计数据。

为了评估时的计算效率,可以使用向量环境来实现,以并行而非顺序地评估 N 个回合。

训练期间记录(定期)

在训练期间,您将运行数百或数千个回合,因此记录每个回合是不切实际的。相反,定期记录以跟踪学习进度。

import logging
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo

# Training configuration
training_period = 250           # Record video every 250 episodes
num_training_episodes = 10_000  # Total training episodes
env_name = "CartPole-v1"

# Set up logging for episode statistics
logging.basicConfig(level=logging.INFO, format='%(message)s')

# Create environment with periodic video recording
env = gym.make(env_name, render_mode="rgb_array")

# Record videos periodically (every 250 episodes)
env = RecordVideo(
    env,
    video_folder="cartpole-training",
    name_prefix="training",
    episode_trigger=lambda x: x % training_period == 0  # Only record every 250th episode
)

# Track statistics for every episode (lightweight)
env = RecordEpisodeStatistics(env)

print(f"Starting training for {num_training_episodes} episodes")
print(f"Videos will be recorded every {training_period} episodes")
print(f"Videos saved to: cartpole-training/")

for episode_num in range(num_training_episodes):
    obs, info = env.reset()
    episode_over = False

    while not episode_over:
        # Replace with your actual training agent
        action = env.action_space.sample()  # Random policy for demonstration
        obs, reward, terminated, truncated, info = env.step(action)
        episode_over = terminated or truncated

    # Log episode statistics (available in info after episode ends)
    if "episode" in info:
        episode_data = info["episode"]
        logging.info(f"Episode {episode_num}: "
                    f"reward={episode_data['r']:.1f}, "
                    f"length={episode_data['l']}, "
                    f"time={episode_data['t']:.2f}s")

        # Additional analysis for milestone episodes
        if episode_num % 1000 == 0:
            # Look at recent performance (last 100 episodes)
            recent_rewards = list(env.return_queue)[-100:]
            if recent_rewards:
                avg_recent = sum(recent_rewards) / len(recent_rewards)
                print(f"  -> Average reward over last 100 episodes: {avg_recent:.1f}")

env.close()

训练记录的好处

进度视频:观看您的智能体随时间改进

  • training-episode-0.mp4:随机初始行为

  • training-episode-250.mp4:一些模式开始出现

  • training-episode-500.mp4:明显改进

  • training-episode-1000.mp4:表现出色

学习曲线:绘制随时间变化的每回合统计数据

import matplotlib.pyplot as plt

# Plot learning progress
episodes = range(len(env.return_queue))
rewards = list(env.return_queue)

plt.figure(figsize=(10, 6))
plt.plot(episodes, rewards, alpha=0.3, label='Episode Rewards')

# Add moving average for clearer trend
window = 100
if len(rewards) > window:
    moving_avg = [sum(rewards[i:i+window])/window
                  for i in range(len(rewards)-window+1)]
    plt.plot(range(window-1, len(rewards)), moving_avg,
             label=f'{window}-Episode Moving Average', linewidth=2)

plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Learning Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

与实验跟踪集成

对于更复杂的项目,与实验跟踪工具集成。

# Example with Weights & Biases (wandb)
import wandb

# Initialize experiment tracking
wandb.init(project="cartpole-training", name="q-learning-run-1")

# Log episode statistics
for episode_num in range(num_training_episodes):
    # ... training code ...

    if "episode" in info:
        episode_data = info["episode"]
        wandb.log({
            "episode": episode_num,
            "reward": episode_data['r'],
            "length": episode_data['l'],
            "episode_time": episode_data['t']
        })

        # Upload videos periodically
        if episode_num % training_period == 0:
            video_path = f"cartpole-training/training-episode-{episode_num}.mp4"
            if os.path.exists(video_path):
                wandb.log({"training_video": wandb.Video(video_path)})

最佳实践总结

对于评估:

  • 记录每个回合以获得完整的性能概览

  • 使用多个种子以确保统计显著性

  • 同时保存视频和数值数据

  • 计算指标的置信区间

对于训练:

  • 定期记录(每100-1000回合)

  • 训练期间侧重于回合统计数据而非视频

  • 对有趣的回合使用自适应记录触发器

  • 监控长时间训练的内存使用情况

对于分析:

  • 创建移动平均线以平滑嘈杂的学习曲线

  • 在成功和失败的回合中寻找模式

  • 比较智能体在训练不同阶段的行为

  • 保存原始数据以备后续分析和比较

更多信息

记录智能体行为是强化学习实践者的必备技能。它有助于您理解智能体实际学习了什么,调试训练问题,并有效传达结果。从简单的记录设置开始,随着项目复杂性的增加逐步添加更复杂的分析!