PPO Agent — LunarLander-v3 🌕

A Proximal Policy Optimization (PPO) agent trained to land a spacecraft on the Moon using Stable-Baselines3 and Gymnasium.

Mean Reward: 269.06 ± 19.17 over 10 evaluation episodes — exceeds the 200-point solve threshold.

Environment

Property	Value
Environment	`LunarLander-v3`
Observation space	Box(8,) — position, velocity, angle, angular vel, leg contacts
Action space	Discrete(4) — do nothing, fire left, fire main, fire right
Solved threshold	≥ 200 mean reward

Reward breakdown

Closer to landing pad → higher reward
Slower movement → higher reward
Tilted angle → penalty
Each leg touching ground → +10
Side engine firing → −0.03/frame
Main engine firing → −0.3/frame
Crash → −100 | Safe landing → +100

Training

Hyperparameter	Value
Algorithm	PPO
Policy	MlpPolicy (2 × 64 Tanh layers)
Total timesteps	1,000,000
Parallel envs	16 (vectorized)
n_steps	1024
batch_size	64
n_epochs	4
gamma	0.999
gae_lambda	0.98
ent_coef	0.01
learning_rate	3e-4 (default)
Device	CPU

Training time: ~8 minutes on Apple M-series CPU.

Results

Metric	Value
Mean reward (10 episodes)	269.06
Std reward	±19.17
Training timesteps	1,000,000
Final ep_rew_mean (training)	~268

Usage

Load and run

from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym

# Load from Hub
checkpoint = load_from_hub(
    repo_id="shivam3002/ppo-LunarLander-v3",
    filename="ppo-LunarLander-v3.zip",
)
model = PPO.load(checkpoint)

# Evaluate
eval_env = Monitor(gym.make("LunarLander-v3", render_mode="human"))
mean_reward, std_reward = evaluate_policy(
    model, eval_env, n_eval_episodes=10, deterministic=True
)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
eval_env.close()

Render a single episode

import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub

checkpoint = load_from_hub("shivam3002/ppo-LunarLander-v3", "ppo-LunarLander-v3.zip")
model = PPO.load(checkpoint)

env = gym.make("LunarLander-v3", render_mode="human")
obs, _ = env.reset()
done = False
total_reward = 0

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Episode reward: {total_reward:.2f}")
env.close()

Training code

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
import gymnasium as gym

# Vectorized training env
env = make_vec_env("LunarLander-v3", n_envs=16)

model = PPO(
    policy="MlpPolicy",
    env=env,
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1,
)

model.learn(total_timesteps=1_000_000)
model.save("ppo-LunarLander-v3")

# Evaluate
eval_env = Monitor(gym.make("LunarLander-v3", render_mode="rgb_array"))
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

Dependencies

gymnasium[box2d]>=1.0
stable-baselines3>=2.0
huggingface_sb3
torch

References

Downloads last month: 50

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on LunarLander-v3
self-reported

269.06 +/- 19.17