PPO Agent β€” LunarLander-v3 πŸŒ•

A Proximal Policy Optimization (PPO) agent trained to land a spacecraft on the Moon using Stable-Baselines3 and Gymnasium.

Mean Reward: 269.06 Β± 19.17 over 10 evaluation episodes β€” exceeds the 200-point solve threshold.


Environment

Property Value
Environment LunarLander-v3
Observation space Box(8,) β€” position, velocity, angle, angular vel, leg contacts
Action space Discrete(4) β€” do nothing, fire left, fire main, fire right
Solved threshold β‰₯ 200 mean reward

Reward breakdown

  • Closer to landing pad β†’ higher reward
  • Slower movement β†’ higher reward
  • Tilted angle β†’ penalty
  • Each leg touching ground β†’ +10
  • Side engine firing β†’ βˆ’0.03/frame
  • Main engine firing β†’ βˆ’0.3/frame
  • Crash β†’ βˆ’100 | Safe landing β†’ +100

Training

Hyperparameter Value
Algorithm PPO
Policy MlpPolicy (2 Γ— 64 Tanh layers)
Total timesteps 1,000,000
Parallel envs 16 (vectorized)
n_steps 1024
batch_size 64
n_epochs 4
gamma 0.999
gae_lambda 0.98
ent_coef 0.01
learning_rate 3e-4 (default)
Device CPU

Training time: ~8 minutes on Apple M-series CPU.


Results

Metric Value
Mean reward (10 episodes) 269.06
Std reward Β±19.17
Training timesteps 1,000,000
Final ep_rew_mean (training) ~268

Usage

Load and run

from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym

# Load from Hub
checkpoint = load_from_hub(
    repo_id="shivam3002/ppo-LunarLander-v3",
    filename="ppo-LunarLander-v3.zip",
)
model = PPO.load(checkpoint)

# Evaluate
eval_env = Monitor(gym.make("LunarLander-v3", render_mode="human"))
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
eval_env.close()

Render a single episode

import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub

checkpoint = load_from_hub("shivam3002/ppo-LunarLander-v3", "ppo-LunarLander-v3.zip")
model = PPO.load(checkpoint)

env = gym.make("LunarLander-v3", render_mode="human")
obs, _ = env.reset()
done = False
total_reward = 0

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Episode reward: {total_reward:.2f}")
env.close()

Training code

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym

# Vectorized training env
env = make_vec_env("LunarLander-v3", n_envs=16)

model = PPO(
    policy="MlpPolicy",
    env=env,
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1,
)

model.learn(total_timesteps=1_000_000)
model.save("ppo-LunarLander-v3")

# Evaluate
eval_env = Monitor(gym.make("LunarLander-v3", render_mode="rgb_array"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

Dependencies

gymnasium[box2d]>=1.0
stable-baselines3>=2.0
huggingface_sb3
torch

References

Downloads last month
50
Video Preview
loading

Evaluation results