PPO Agent β LunarLander-v3 π
A Proximal Policy Optimization (PPO) agent trained to land a spacecraft on the Moon using Stable-Baselines3 and Gymnasium.
Mean Reward: 269.06 Β± 19.17 over 10 evaluation episodes β exceeds the 200-point solve threshold.
Environment
| Property | Value |
|---|---|
| Environment | LunarLander-v3 |
| Observation space | Box(8,) β position, velocity, angle, angular vel, leg contacts |
| Action space | Discrete(4) β do nothing, fire left, fire main, fire right |
| Solved threshold | β₯ 200 mean reward |
Reward breakdown
- Closer to landing pad β higher reward
- Slower movement β higher reward
- Tilted angle β penalty
- Each leg touching ground β +10
- Side engine firing β β0.03/frame
- Main engine firing β β0.3/frame
- Crash β β100 | Safe landing β +100
Training
| Hyperparameter | Value |
|---|---|
| Algorithm | PPO |
| Policy | MlpPolicy (2 Γ 64 Tanh layers) |
| Total timesteps | 1,000,000 |
| Parallel envs | 16 (vectorized) |
| n_steps | 1024 |
| batch_size | 64 |
| n_epochs | 4 |
| gamma | 0.999 |
| gae_lambda | 0.98 |
| ent_coef | 0.01 |
| learning_rate | 3e-4 (default) |
| Device | CPU |
Training time: ~8 minutes on Apple M-series CPU.
Results
| Metric | Value |
|---|---|
| Mean reward (10 episodes) | 269.06 |
| Std reward | Β±19.17 |
| Training timesteps | 1,000,000 |
| Final ep_rew_mean (training) | ~268 |
Usage
Load and run
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym
# Load from Hub
checkpoint = load_from_hub(
repo_id="shivam3002/ppo-LunarLander-v3",
filename="ppo-LunarLander-v3.zip",
)
model = PPO.load(checkpoint)
# Evaluate
eval_env = Monitor(gym.make("LunarLander-v3", render_mode="human"))
mean_reward, std_reward = evaluate_policy(
model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
eval_env.close()
Render a single episode
import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
checkpoint = load_from_hub("shivam3002/ppo-LunarLander-v3", "ppo-LunarLander-v3.zip")
model = PPO.load(checkpoint)
env = gym.make("LunarLander-v3", render_mode="human")
obs, _ = env.reset()
done = False
total_reward = 0
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Episode reward: {total_reward:.2f}")
env.close()
Training code
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym
# Vectorized training env
env = make_vec_env("LunarLander-v3", n_envs=16)
model = PPO(
policy="MlpPolicy",
env=env,
n_steps=1024,
batch_size=64,
n_epochs=4,
gamma=0.999,
gae_lambda=0.98,
ent_coef=0.01,
verbose=1,
)
model.learn(total_timesteps=1_000_000)
model.save("ppo-LunarLander-v3")
# Evaluate
eval_env = Monitor(gym.make("LunarLander-v3", render_mode="rgb_array"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
Dependencies
gymnasium[box2d]>=1.0
stable-baselines3>=2.0
huggingface_sb3
torch
References
- Downloads last month
- 50
Evaluation results
- mean_reward on LunarLander-v3self-reported269.06 +/- 19.17