记录智能体¶
为什么要记录您的智能体?¶
记录智能体行为在强化学习开发中具有几个重要目的
🎥 视觉理解:精确地看到您的智能体正在做什么——有时一个10秒的视频就能揭示盯着奖励图数小时都发现不了的问题。
📊 性能跟踪:收集关于回合奖励、时长和时间安排的系统数据,以了解训练进度。
🐛 调试:识别特定的故障模式、异常行为,或者您的智能体遇到困难的环境。
📈 评估:客观地比较不同的训练运行、算法或超参数。
🎓 交流:与合作者分享结果,纳入论文,或创建教育内容。
何时记录¶
评估期间(记录每个回合)
测试训练好的智能体的最终性能
创建演示视频
对特定行为进行详细分析
训练期间(定期记录)
随时间监控学习进度
及早发现训练问题
创建学习的延时视频
Gymnasium 提供了两个重要的封装器用于记录:RecordEpisodeStatistics
用于数值数据记录,以及 RecordVideo
用于视频记录。前者跟踪回合指标,如总奖励、回合长度和耗时。后者使用环境渲染生成智能体行为的 MP4 视频。
我们将展示如何在两种常见场景中使用这些封装器:记录每个回合的数据(通常在评估期间)和定期记录数据(在训练期间)。
记录每个回合(评估)¶
在评估训练好的智能体时,通常需要记录多个回合,以了解平均性能和一致性。下面是如何使用 RecordEpisodeStatistics
和 RecordVideo
进行设置。
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
import numpy as np
# Configuration
num_eval_episodes = 4
env_name = "CartPole-v1" # Replace with your environment
# Create environment with recording capabilities
env = gym.make(env_name, render_mode="rgb_array") # rgb_array needed for video recording
# Add video recording for every episode
env = RecordVideo(
env,
video_folder="cartpole-agent", # Folder to save videos
name_prefix="eval", # Prefix for video filenames
episode_trigger=lambda x: True # Record every episode
)
# Add episode statistics tracking
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)
print(f"Starting evaluation for {num_eval_episodes} episodes...")
print(f"Videos will be saved to: cartpole-agent/")
for episode_num in range(num_eval_episodes):
obs, info = env.reset()
episode_reward = 0
step_count = 0
episode_over = False
while not episode_over:
# Replace this with your trained agent's policy
action = env.action_space.sample() # Random policy for demonstration
obs, reward, terminated, truncated, info = env.step(action)
episode_reward += reward
step_count += 1
episode_over = terminated or truncated
print(f"Episode {episode_num + 1}: {step_count} steps, reward = {episode_reward}")
env.close()
# Print summary statistics
print(f'\nEvaluation Summary:')
print(f'Episode durations: {list(env.time_queue)}')
print(f'Episode rewards: {list(env.return_queue)}')
print(f'Episode lengths: {list(env.length_queue)}')
# Calculate some useful metrics
avg_reward = np.sum(env.return_queue)
avg_length = np.sum(env.length_queue)
std_reward = np.std(env.return_queue)
print(f'\nAverage reward: {avg_reward:.2f} ± {std_reward:.2f}')
print(f'Average episode length: {avg_length:.1f} steps')
print(f'Success rate: {sum(1 for r in env.return_queue if r > 0) / len(env.return_queue):.1%}')
理解输出¶
运行此代码后,您将找到:
视频文件:cartpole-agent/eval-episode-0.mp4
、eval-episode-1.mp4
等。
每个文件显示一个从开始到结束的完整回合
有助于精确了解您的智能体如何表现
可以共享、嵌入演示文稿或逐帧分析
控制台输出:逐回合性能加上汇总统计数据
Episode 1: 23 steps, reward = 23.0
Episode 2: 15 steps, reward = 15.0
Episode 3: 200 steps, reward = 200.0
Episode 4: 67 steps, reward = 67.0
Average reward: 76.25 ± 78.29
Average episode length: 76.2 steps
Success rate: 100.0%
统计队列:每个回合的时间、奖励和长度数据
env.time_queue
:每个回合耗时(实际时间)env.return_queue
:每个回合的总奖励env.length_queue
:每个回合的步数
在上面的脚本中,RecordVideo
封装器将视频保存为“eval-episode-0.mp4”等文件名到指定文件夹。 episode_trigger=lambda x: True
确保每个回合都被记录。
RecordEpisodeStatistics
封装器跟踪内部队列中的性能指标,我们可以在评估后访问这些队列来计算平均值和其他统计数据。
为了评估时的计算效率,可以使用向量环境来实现,以并行而非顺序地评估 N 个回合。
训练期间记录(定期)¶
在训练期间,您将运行数百或数千个回合,因此记录每个回合是不切实际的。相反,定期记录以跟踪学习进度。
import logging
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
# Training configuration
training_period = 250 # Record video every 250 episodes
num_training_episodes = 10_000 # Total training episodes
env_name = "CartPole-v1"
# Set up logging for episode statistics
logging.basicConfig(level=logging.INFO, format='%(message)s')
# Create environment with periodic video recording
env = gym.make(env_name, render_mode="rgb_array")
# Record videos periodically (every 250 episodes)
env = RecordVideo(
env,
video_folder="cartpole-training",
name_prefix="training",
episode_trigger=lambda x: x % training_period == 0 # Only record every 250th episode
)
# Track statistics for every episode (lightweight)
env = RecordEpisodeStatistics(env)
print(f"Starting training for {num_training_episodes} episodes")
print(f"Videos will be recorded every {training_period} episodes")
print(f"Videos saved to: cartpole-training/")
for episode_num in range(num_training_episodes):
obs, info = env.reset()
episode_over = False
while not episode_over:
# Replace with your actual training agent
action = env.action_space.sample() # Random policy for demonstration
obs, reward, terminated, truncated, info = env.step(action)
episode_over = terminated or truncated
# Log episode statistics (available in info after episode ends)
if "episode" in info:
episode_data = info["episode"]
logging.info(f"Episode {episode_num}: "
f"reward={episode_data['r']:.1f}, "
f"length={episode_data['l']}, "
f"time={episode_data['t']:.2f}s")
# Additional analysis for milestone episodes
if episode_num % 1000 == 0:
# Look at recent performance (last 100 episodes)
recent_rewards = list(env.return_queue)[-100:]
if recent_rewards:
avg_recent = sum(recent_rewards) / len(recent_rewards)
print(f" -> Average reward over last 100 episodes: {avg_recent:.1f}")
env.close()
训练记录的好处¶
进度视频:观看您的智能体随时间改进
training-episode-0.mp4
:随机初始行为training-episode-250.mp4
:一些模式开始出现training-episode-500.mp4
:明显改进training-episode-1000.mp4
:表现出色
学习曲线:绘制随时间变化的每回合统计数据
import matplotlib.pyplot as plt
# Plot learning progress
episodes = range(len(env.return_queue))
rewards = list(env.return_queue)
plt.figure(figsize=(10, 6))
plt.plot(episodes, rewards, alpha=0.3, label='Episode Rewards')
# Add moving average for clearer trend
window = 100
if len(rewards) > window:
moving_avg = [sum(rewards[i:i+window])/window
for i in range(len(rewards)-window+1)]
plt.plot(range(window-1, len(rewards)), moving_avg,
label=f'{window}-Episode Moving Average', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Learning Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
与实验跟踪集成¶
对于更复杂的项目,与实验跟踪工具集成。
# Example with Weights & Biases (wandb)
import wandb
# Initialize experiment tracking
wandb.init(project="cartpole-training", name="q-learning-run-1")
# Log episode statistics
for episode_num in range(num_training_episodes):
# ... training code ...
if "episode" in info:
episode_data = info["episode"]
wandb.log({
"episode": episode_num,
"reward": episode_data['r'],
"length": episode_data['l'],
"episode_time": episode_data['t']
})
# Upload videos periodically
if episode_num % training_period == 0:
video_path = f"cartpole-training/training-episode-{episode_num}.mp4"
if os.path.exists(video_path):
wandb.log({"training_video": wandb.Video(video_path)})
最佳实践总结¶
对于评估:
记录每个回合以获得完整的性能概览
使用多个种子以确保统计显著性
同时保存视频和数值数据
计算指标的置信区间
对于训练:
定期记录(每100-1000回合)
训练期间侧重于回合统计数据而非视频
对有趣的回合使用自适应记录触发器
监控长时间训练的内存使用情况
对于分析:
创建移动平均线以平滑嘈杂的学习曲线
在成功和失败的回合中寻找模式
比较智能体在训练不同阶段的行为
保存原始数据以备后续分析和比较
更多信息¶
记录智能体行为是强化学习实践者的必备技能。它有助于您理解智能体实际学习了什么,调试训练问题,并有效传达结果。从简单的记录设置开始,随着项目复杂性的增加逐步添加更复杂的分析!