Spaces:
Running
iter4: fix the 'constant belief = free reward' bug + 6 other deep issues
Browse filesInvestigation prompted by external code review. Validated 5/5 critical
findings against actual code; applied fixes for the 7 most impactful.
[1] Anomaly signals NOW REACH THE LLM (Issue 1, was CRITICAL)
Env was computing per-meter anomalies (actual_delta - expected_under_neutral)
and stashing them in reward_breakdown — but format_observation_prompt only
read meters and step_history. StepRecord didn't include anomalies. So the
cleanest profile-inference signal in the env was COMPUTED BUT NEVER SHOWN
to the agent. The agent has been doing meta-learning blindfolded.
Fix: extend StepRecord with 5 anomaly fields, populate in env.step,
surface in prompt and inference.py. The agent now sees how each person's
response deviates from the average baseline — direct profile fingerprint.
[2] Belief baseline subtraction (Issue 4, was CRITICAL — biggest impact)
OLD: reward = (1 - mae) - 0.5
Constant '5 5 5' emission scored +0.336 raw per step (× 3.0 weight =
+1.008 per step = +28 per episode for ZERO learning). This is the
iter-1 mode collapse mechanism in disguise. The agent learned to just
emit constant belief and harvest free reward.
NEW: reward = similarity - constant_baseline_similarity
Constant emission now gets -0.03 per step (negative!). Perfect emission
gets +0.37 per step. The learning gradient is now real (+0.4 gap).
[3] Profile weight cap 0.80 -> 0.45 (Issue 5)
Previously 38% of sampled profiles had one meter > 0.50 weight.
Some weighted vitality/cognition heavily, making SLEEP-spam OPTIMAL
(the env was correctly rewarding it; agent wasn't reward-hacking).
With cap at 0.45, every profile must weight 3+ meters meaningfully.
Forces varied-action strategies to be optimal across all profiles.
[4] Scaled-down shaping (Issue 3)
Iter 3 had -0.3 (3-in-row) / -0.4 (cycle) / +0.2 (new-action) shaping
that swung total reward by ~0.9 — overwhelming the ±0.5 env signal.
GRPO advantage was dominated by 'did this action vary' not 'did this fit
the profile'. Reduced to -0.10 / -0.15 / +0.07 — nudges, not overrides.
[5] Step-0 belief reward skipped (Issue 9)
At step 0, agent has no observation history. Optimal play is constant
prior, which pulled the rest of training toward constant emission.
Skip belief reward when step_index == 0.
[6] Belief-action coupling reward (Issue 10)
Iter 3 made belief generation come BEFORE action so causal attention
would link them, but no reward gradient enforced consistency. Added
explicit ±0.15 bonus/penalty for action choice matching emitted belief
(high social belief + SOCIALIZE = +0.15, high social + MEDITATE = -0.10,
morning belief + DEEP_WORK in morning slot = +0.15, etc.).
Now there's a direct training signal that the belief should INFORM the
action — completing the meta-learning loop.
[7] grader_bias moved out of _compute_reward into env_reward (Issue 11)
Previously the +0.5*Δprogress + 0.4*Δconnection bias lived in the env's
per-step reward, which fed _step_rewards, which fed adaptation_score in
the grader. The 'alignment' was partially self-cancelling.
Now: env per-step reward is pure profile-weighted (uncontaminated
inference signal); grader_bias only shapes the GRPO-visible training
reward. Grader's adaptation_score computed on raw rewards.
All 31 tests pass. Verified math:
- Constant '5 5 5' weighted reward: +1.0 per step -> -0.03 per step
- Perfect belief weighted reward: similar -> +0.37 per step
- Step 0 belief reward: rewarded -> 0 (no info)
Iter 3 still running as control to confirm these are the right fixes.
Iter 4 ready to submit when iter 3 completes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- inference.py +8 -1
- models.py +12 -2
- server/rhythm_environment.py +26 -14
- training/dataset.py +14 -1
- training/reward_functions.py +73 -22
|
@@ -199,14 +199,21 @@ def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
|
|
| 199 |
|
| 200 |
history_lines = []
|
| 201 |
for h in (getattr(obs, "step_history", None) or [])[-5:]:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
history_lines.append(
|
| 203 |
f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
|
| 204 |
f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
|
| 205 |
f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
|
|
|
|
| 206 |
)
|
| 207 |
history_str = ""
|
| 208 |
if history_lines:
|
| 209 |
-
history_str = "\n\nRecent history:\n" + "\n".join(history_lines)
|
| 210 |
|
| 211 |
user_prompt = textwrap.dedent(f"""\
|
| 212 |
Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
|
|
|
|
| 199 |
|
| 200 |
history_lines = []
|
| 201 |
for h in (getattr(obs, "step_history", None) or [])[-5:]:
|
| 202 |
+
# Iter 4 fix: include anomalies for profile-inference signal
|
| 203 |
+
va = getattr(h, "vitality_anomaly", 0.0)
|
| 204 |
+
ca = getattr(h, "cognition_anomaly", 0.0)
|
| 205 |
+
pa = getattr(h, "progress_anomaly", 0.0)
|
| 206 |
+
sa = getattr(h, "serenity_anomaly", 0.0)
|
| 207 |
+
cna = getattr(h, "connection_anomaly", 0.0)
|
| 208 |
history_lines.append(
|
| 209 |
f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
|
| 210 |
f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
|
| 211 |
f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
|
| 212 |
+
f" [anom V{va:+.2f} C{ca:+.2f} P{pa:+.2f} S{sa:+.2f} Cn{cna:+.2f}]"
|
| 213 |
)
|
| 214 |
history_str = ""
|
| 215 |
if history_lines:
|
| 216 |
+
history_str = "\n\nRecent history (anom = profile-inference signal):\n" + "\n".join(history_lines)
|
| 217 |
|
| 218 |
user_prompt = textwrap.dedent(f"""\
|
| 219 |
Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
|
|
@@ -50,8 +50,12 @@ class StepRecord(BaseModel):
|
|
| 50 |
"""
|
| 51 |
Record of one completed step included in step_history.
|
| 52 |
|
| 53 |
-
Contains the action taken, the reward received,
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
"""
|
| 56 |
|
| 57 |
step: int
|
|
@@ -62,6 +66,12 @@ class StepRecord(BaseModel):
|
|
| 62 |
progress_delta: float = 0.0
|
| 63 |
serenity_delta: float = 0.0
|
| 64 |
connection_delta: float = 0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
|
| 67 |
class RhythmObservation(Observation):
|
|
|
|
| 50 |
"""
|
| 51 |
Record of one completed step included in step_history.
|
| 52 |
|
| 53 |
+
Contains the action taken, the reward received, per-meter deltas, AND
|
| 54 |
+
per-meter ANOMALIES (actual_delta - expected_delta_under_neutral_profile).
|
| 55 |
+
The anomalies are the cleanest profile-inference signal — they tell the
|
| 56 |
+
agent how much THIS person's response deviates from the average person.
|
| 57 |
+
Without them, the agent has to back out the profile from raw deltas
|
| 58 |
+
without a baseline to compare against (much harder).
|
| 59 |
"""
|
| 60 |
|
| 61 |
step: int
|
|
|
|
| 66 |
progress_delta: float = 0.0
|
| 67 |
serenity_delta: float = 0.0
|
| 68 |
connection_delta: float = 0.0
|
| 69 |
+
# Iter 4 fix: anomalies (was computed in env but not exposed to agent)
|
| 70 |
+
vitality_anomaly: float = 0.0
|
| 71 |
+
cognition_anomaly: float = 0.0
|
| 72 |
+
progress_anomaly: float = 0.0
|
| 73 |
+
serenity_anomaly: float = 0.0
|
| 74 |
+
connection_anomaly: float = 0.0
|
| 75 |
|
| 76 |
|
| 77 |
class RhythmObservation(Observation):
|
|
@@ -215,8 +215,10 @@ def sample_profile(seed: int) -> Dict[str, Any]:
|
|
| 215 |
raw = [rng.gammavariate(a, 1.0) for a in alphas]
|
| 216 |
total = sum(raw)
|
| 217 |
weights = [w / total for w in raw]
|
| 218 |
-
#
|
| 219 |
-
|
|
|
|
|
|
|
| 220 |
total = sum(weights)
|
| 221 |
weights = [w / total for w in weights]
|
| 222 |
|
|
@@ -528,6 +530,8 @@ class RhythmEnvironment(Environment):
|
|
| 528 |
self._state.active_event = active_event
|
| 529 |
|
| 530 |
# --- 15. Append completed step to rolling history ---
|
|
|
|
|
|
|
| 531 |
self._step_history.append({
|
| 532 |
"step": current_step,
|
| 533 |
"action": action_name,
|
|
@@ -537,6 +541,11 @@ class RhythmEnvironment(Environment):
|
|
| 537 |
"progress_delta": round(deltas["progress"], 4),
|
| 538 |
"serenity_delta": round(deltas["serenity"], 4),
|
| 539 |
"connection_delta": round(deltas["connection"], 4),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 540 |
})
|
| 541 |
if len(self._step_history) > HISTORY_LENGTH:
|
| 542 |
self._step_history.pop(0)
|
|
@@ -702,20 +711,18 @@ class RhythmEnvironment(Environment):
|
|
| 702 |
self._vitality = max(0.0, self._vitality - vd)
|
| 703 |
|
| 704 |
def _compute_reward(self, deltas: Dict[str, float]) -> float:
|
| 705 |
-
"""Compute
|
| 706 |
-
|
| 707 |
-
Iter
|
| 708 |
-
|
| 709 |
-
|
| 710 |
-
|
| 711 |
-
|
| 712 |
-
|
|
|
|
| 713 |
"""
|
| 714 |
weights = self._profile["reward_weights"]
|
| 715 |
-
|
| 716 |
-
# Grader-aligned bias: scaled so max bonus is ~0.1/step (manageable vs profile_reward)
|
| 717 |
-
grader_bias = 0.5 * deltas["progress"] + 0.4 * deltas["connection"]
|
| 718 |
-
return profile_reward + grader_bias
|
| 719 |
|
| 720 |
def _grade_episode(self) -> float:
|
| 721 |
"""
|
|
@@ -800,6 +807,11 @@ class RhythmEnvironment(Environment):
|
|
| 800 |
progress_delta=h["progress_delta"],
|
| 801 |
serenity_delta=h["serenity_delta"],
|
| 802 |
connection_delta=h["connection_delta"],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 803 |
)
|
| 804 |
for h in self._step_history
|
| 805 |
]
|
|
|
|
| 215 |
raw = [rng.gammavariate(a, 1.0) for a in alphas]
|
| 216 |
total = sum(raw)
|
| 217 |
weights = [w / total for w in raw]
|
| 218 |
+
# Iter 4 fix: tighter clamp (0.45 max) forces every profile to weight 3+
|
| 219 |
+
# meters meaningfully. Old 0.80 cap allowed single-meter dominant profiles
|
| 220 |
+
# where SLEEP-spam was correctly the optimal play (env wasn't lying).
|
| 221 |
+
weights = [max(0.05, min(0.45, w)) for w in weights]
|
| 222 |
total = sum(weights)
|
| 223 |
weights = [w / total for w in weights]
|
| 224 |
|
|
|
|
| 530 |
self._state.active_event = active_event
|
| 531 |
|
| 532 |
# --- 15. Append completed step to rolling history ---
|
| 533 |
+
# Iter 4 fix: include anomalies (was computed but only stashed in
|
| 534 |
+
# reward_breakdown which the prompt builder never read)
|
| 535 |
self._step_history.append({
|
| 536 |
"step": current_step,
|
| 537 |
"action": action_name,
|
|
|
|
| 541 |
"progress_delta": round(deltas["progress"], 4),
|
| 542 |
"serenity_delta": round(deltas["serenity"], 4),
|
| 543 |
"connection_delta": round(deltas["connection"], 4),
|
| 544 |
+
"vitality_anomaly": round(deltas["vitality"] - expected_no_profile["vitality"], 4),
|
| 545 |
+
"cognition_anomaly": round(deltas["cognition"] - expected_no_profile["cognition"], 4),
|
| 546 |
+
"progress_anomaly": round(deltas["progress"] - expected_no_profile["progress"], 4),
|
| 547 |
+
"serenity_anomaly": round(deltas["serenity"] - expected_no_profile["serenity"], 4),
|
| 548 |
+
"connection_anomaly": round(deltas["connection"] - expected_no_profile["connection"], 4),
|
| 549 |
})
|
| 550 |
if len(self._step_history) > HISTORY_LENGTH:
|
| 551 |
self._step_history.pop(0)
|
|
|
|
| 711 |
self._vitality = max(0.0, self._vitality - vd)
|
| 712 |
|
| 713 |
def _compute_reward(self, deltas: Dict[str, float]) -> float:
|
| 714 |
+
"""Compute pure profile-weighted per-step reward.
|
| 715 |
+
|
| 716 |
+
Iter 4 fix: REMOVED the grader_bias term from here (moved to the
|
| 717 |
+
TRAINING reward function in reward_functions.py). Keeping the env's
|
| 718 |
+
per-step reward pure means:
|
| 719 |
+
- Inference signal (which depends on profile_weights) is uncontaminated
|
| 720 |
+
- Grader's adaptation_score isn't computed on biased rewards (no
|
| 721 |
+
self-cancelling alignment)
|
| 722 |
+
- The env's reward semantics match what an honest deployment would see
|
| 723 |
"""
|
| 724 |
weights = self._profile["reward_weights"]
|
| 725 |
+
return sum(deltas[m] * weights[m] for m in METERS) * REWARD_SCALE
|
|
|
|
|
|
|
|
|
|
| 726 |
|
| 727 |
def _grade_episode(self) -> float:
|
| 728 |
"""
|
|
|
|
| 807 |
progress_delta=h["progress_delta"],
|
| 808 |
serenity_delta=h["serenity_delta"],
|
| 809 |
connection_delta=h["connection_delta"],
|
| 810 |
+
vitality_anomaly=h.get("vitality_anomaly", 0.0),
|
| 811 |
+
cognition_anomaly=h.get("cognition_anomaly", 0.0),
|
| 812 |
+
progress_anomaly=h.get("progress_anomaly", 0.0),
|
| 813 |
+
serenity_anomaly=h.get("serenity_anomaly", 0.0),
|
| 814 |
+
connection_anomaly=h.get("connection_anomaly", 0.0),
|
| 815 |
)
|
| 816 |
for h in self._step_history
|
| 817 |
]
|
|
@@ -72,14 +72,27 @@ def format_observation_prompt(obs, profile_hint: dict | None = None) -> str:
|
|
| 72 |
|
| 73 |
history_lines = []
|
| 74 |
for h in (obs.step_history or [])[-5:]: # last 5 only to fit prompt budget
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
history_lines.append(
|
| 76 |
f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
|
| 77 |
f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
|
| 78 |
f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
|
|
|
|
| 79 |
)
|
| 80 |
history_str = ""
|
| 81 |
if history_lines:
|
| 82 |
-
history_str =
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
hint_str = ""
|
| 85 |
if profile_hint is not None:
|
|
|
|
| 72 |
|
| 73 |
history_lines = []
|
| 74 |
for h in (obs.step_history or [])[-5:]: # last 5 only to fit prompt budget
|
| 75 |
+
# Iter 4 fix: include ANOMALIES (actual_delta - expected_under_neutral_profile).
|
| 76 |
+
# Anomalies are the cleanest profile-inference signal: they show how
|
| 77 |
+
# this person's response DEVIATES from average. Previously the env
|
| 78 |
+
# computed these but never exposed them to the agent.
|
| 79 |
+
anom_str = (
|
| 80 |
+
f" [anom V{h.vitality_anomaly:+.2f} C{h.cognition_anomaly:+.2f} "
|
| 81 |
+
f"P{h.progress_anomaly:+.2f} S{h.serenity_anomaly:+.2f} "
|
| 82 |
+
f"Cn{h.connection_anomaly:+.2f}]"
|
| 83 |
+
)
|
| 84 |
history_lines.append(
|
| 85 |
f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
|
| 86 |
f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
|
| 87 |
f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
|
| 88 |
+
f"{anom_str}"
|
| 89 |
)
|
| 90 |
history_str = ""
|
| 91 |
if history_lines:
|
| 92 |
+
history_str = (
|
| 93 |
+
"\n\nRecent history (anom = how this person deviated from neutral baseline):\n"
|
| 94 |
+
+ "\n".join(history_lines)
|
| 95 |
+
)
|
| 96 |
|
| 97 |
hint_str = ""
|
| 98 |
if profile_hint is not None:
|
|
@@ -238,31 +238,66 @@ def env_reward(
|
|
| 238 |
|
| 239 |
try:
|
| 240 |
env = _replay_env(ep_seed, ep_history, ep_mode)
|
|
|
|
|
|
|
|
|
|
| 241 |
obs = env.step(RhythmAction(action_type=action_type))
|
| 242 |
reward = obs.reward
|
| 243 |
chosen = action_type.value
|
| 244 |
|
| 245 |
-
# Iter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
if ep_history and len(ep_history) >= 2:
|
| 247 |
recent3 = ep_history[-3:]
|
| 248 |
-
if recent3.count(chosen) >= 2:
|
| 249 |
-
reward -= 0.3
|
| 250 |
|
| 251 |
-
# Iter 3 fix: N-CYCLE penalty (catches the M-E-M-E-... loop iter 2 fell into)
|
| 252 |
-
# If last 6 actions (including this one) have <=2 unique values, apply penalty
|
| 253 |
if ep_history and len(ep_history) >= 5:
|
| 254 |
last6 = ep_history[-5:] + [chosen]
|
| 255 |
if len(set(last6)) <= 2:
|
| 256 |
-
reward -= 0.4
|
| 257 |
|
| 258 |
-
# Iter 3 fix: NEW-ACTION exploration bonus
|
| 259 |
-
# If this action hasn't appeared yet in the current episode, +0.2.
|
| 260 |
-
# Strong incentive in early steps to TRY varied actions, fading as
|
| 261 |
-
# the action set grows. Stops once 6+ different actions tried.
|
| 262 |
if ep_history is not None:
|
| 263 |
seen = set(ep_history)
|
| 264 |
if chosen not in seen and len(seen) < 6:
|
| 265 |
-
reward += 0.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 266 |
|
| 267 |
scores.append(reward)
|
| 268 |
except Exception:
|
|
@@ -283,14 +318,19 @@ def belief_accuracy(
|
|
| 283 |
"""
|
| 284 |
Layer 4: Belief-vector accuracy reward (META-LEARNING signal).
|
| 285 |
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 291 |
|
| 292 |
-
|
| 293 |
-
|
| 294 |
"""
|
| 295 |
scores = []
|
| 296 |
for i, completion in enumerate(completions):
|
|
@@ -298,24 +338,35 @@ def belief_accuracy(
|
|
| 298 |
_, belief, belief_provided = extract_action_and_belief(response)
|
| 299 |
|
| 300 |
if not belief_provided:
|
| 301 |
-
scores.append(-0.
|
| 302 |
continue
|
| 303 |
|
| 304 |
-
# Resolve seed/mode for replay
|
| 305 |
if seed is not None and i < len(seed):
|
| 306 |
ep_seed = seed[i]
|
| 307 |
ep_history = action_history[i] if action_history is not None else []
|
| 308 |
ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
|
|
|
|
| 309 |
else:
|
| 310 |
scores.append(0.0)
|
| 311 |
continue
|
| 312 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
try:
|
| 314 |
env = _replay_env(ep_seed, ep_history, ep_mode)
|
| 315 |
true_belief = env.get_belief_target()
|
|
|
|
| 316 |
mae = sum(abs(b - t) for b, t in zip(belief, true_belief)) / 3.0
|
| 317 |
-
similarity = 1.0 - mae
|
| 318 |
-
|
|
|
|
|
|
|
|
|
|
| 319 |
except Exception:
|
| 320 |
scores.append(0.0)
|
| 321 |
|
|
|
|
| 238 |
|
| 239 |
try:
|
| 240 |
env = _replay_env(ep_seed, ep_history, ep_mode)
|
| 241 |
+
# Capture pre-step meters so we can compute deltas for the bias
|
| 242 |
+
pre_progress = env._progress
|
| 243 |
+
pre_connection = env._connection
|
| 244 |
obs = env.step(RhythmAction(action_type=action_type))
|
| 245 |
reward = obs.reward
|
| 246 |
chosen = action_type.value
|
| 247 |
|
| 248 |
+
# Iter 4 fix (Issue 11): grader-aligned bias moved here from env.
|
| 249 |
+
# Per-step env reward stays pure (drives belief inference); the
|
| 250 |
+
# bias only shapes the GRPO-visible training reward.
|
| 251 |
+
progress_delta = env._progress - pre_progress
|
| 252 |
+
connection_delta = env._connection - pre_connection
|
| 253 |
+
reward += 0.5 * progress_delta + 0.4 * connection_delta
|
| 254 |
+
|
| 255 |
+
# Iter 4 fix (Issue 3): SCALED-DOWN shaping. Iter 3 had
|
| 256 |
+
# -0.3/-0.4/+0.2 which dominated the ±0.5 env signal. Now
|
| 257 |
+
# roughly 1/3 of original magnitudes — nudges, not overrides.
|
| 258 |
if ep_history and len(ep_history) >= 2:
|
| 259 |
recent3 = ep_history[-3:]
|
| 260 |
+
if recent3.count(chosen) >= 2:
|
| 261 |
+
reward -= 0.10 # was -0.3
|
| 262 |
|
|
|
|
|
|
|
| 263 |
if ep_history and len(ep_history) >= 5:
|
| 264 |
last6 = ep_history[-5:] + [chosen]
|
| 265 |
if len(set(last6)) <= 2:
|
| 266 |
+
reward -= 0.15 # was -0.4
|
| 267 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
if ep_history is not None:
|
| 269 |
seen = set(ep_history)
|
| 270 |
if chosen not in seen and len(seen) < 6:
|
| 271 |
+
reward += 0.07 # was +0.2
|
| 272 |
+
|
| 273 |
+
# Iter 4 fix (Issue 10): BELIEF-ACTION COUPLING reward.
|
| 274 |
+
# Parse the agent's emitted belief and reward consistency between
|
| 275 |
+
# belief and action choice. Without this, the belief-first format
|
| 276 |
+
# only enforces consistency via causal attention (weak); now there's
|
| 277 |
+
# an explicit gradient signal.
|
| 278 |
+
_, b, b_provided = extract_action_and_belief(response)
|
| 279 |
+
if b_provided:
|
| 280 |
+
s_pref, m_pref, w_pref = b
|
| 281 |
+
# High social → social actions; low social → solo actions
|
| 282 |
+
if s_pref > 0.65 and chosen in {"socialize", "family_time"}:
|
| 283 |
+
reward += 0.15
|
| 284 |
+
elif s_pref < 0.35 and chosen in {"meditate", "me_time"}:
|
| 285 |
+
reward += 0.15
|
| 286 |
+
elif s_pref > 0.65 and chosen in {"meditate", "me_time"}:
|
| 287 |
+
reward -= 0.10 # belief says extrovert, action says solo
|
| 288 |
+
elif s_pref < 0.35 and chosen in {"socialize", "family_time"}:
|
| 289 |
+
reward -= 0.10 # belief says introvert, action says social
|
| 290 |
+
|
| 291 |
+
# High morning + morning slot + work → bonus
|
| 292 |
+
slot = obs.slot if hasattr(obs, "slot") else 0
|
| 293 |
+
if m_pref > 0.65 and slot == 0 and chosen in {"deep_work", "learn"}:
|
| 294 |
+
reward += 0.15
|
| 295 |
+
elif m_pref < 0.35 and slot in (2, 3) and chosen in {"deep_work", "learn"}:
|
| 296 |
+
reward += 0.15
|
| 297 |
+
|
| 298 |
+
# High work → work actions
|
| 299 |
+
if w_pref > 0.65 and chosen in {"deep_work", "learn", "admin_work"}:
|
| 300 |
+
reward += 0.15
|
| 301 |
|
| 302 |
scores.append(reward)
|
| 303 |
except Exception:
|
|
|
|
| 318 |
"""
|
| 319 |
Layer 4: Belief-vector accuracy reward (META-LEARNING signal).
|
| 320 |
|
| 321 |
+
ITER 4 FIX (Issue 4 from external review): The previous formula
|
| 322 |
+
`(1 - mae) - 0.5` gave constant emission "5 5 5" a free +0.336 reward
|
| 323 |
+
per step (× 3.0 weight = +1.0 per step = +28 per episode for ZERO learning).
|
| 324 |
+
This recreated the iter-1 collapse mechanism in disguise.
|
| 325 |
+
|
| 326 |
+
New formula: subtract the per-profile baseline. The baseline is what a
|
| 327 |
+
constant 0.5 emission WOULD score for THIS profile. Now:
|
| 328 |
+
- Constant emission → reward ≈ 0 (no free reward)
|
| 329 |
+
- Better-than-baseline belief → positive
|
| 330 |
+
- Worse-than-baseline belief → negative
|
| 331 |
|
| 332 |
+
Plus iter 4 (Issue 9): no belief reward at step 0 (no information available
|
| 333 |
+
to commit a belief) — prevents pulling the policy toward a constant prior.
|
| 334 |
"""
|
| 335 |
scores = []
|
| 336 |
for i, completion in enumerate(completions):
|
|
|
|
| 338 |
_, belief, belief_provided = extract_action_and_belief(response)
|
| 339 |
|
| 340 |
if not belief_provided:
|
| 341 |
+
scores.append(-0.1) # weak push toward emitting belief
|
| 342 |
continue
|
| 343 |
|
| 344 |
+
# Resolve seed/mode/step for replay
|
| 345 |
if seed is not None and i < len(seed):
|
| 346 |
ep_seed = seed[i]
|
| 347 |
ep_history = action_history[i] if action_history is not None else []
|
| 348 |
ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
|
| 349 |
+
ep_step = step_index[i] if (step_index is not None and i < len(step_index)) else 0
|
| 350 |
else:
|
| 351 |
scores.append(0.0)
|
| 352 |
continue
|
| 353 |
|
| 354 |
+
# Iter 4 fix (Issue 9): step-0 commitment with no info biases toward
|
| 355 |
+
# constant prior. Skip belief reward at step 0.
|
| 356 |
+
if ep_step == 0:
|
| 357 |
+
scores.append(0.0)
|
| 358 |
+
continue
|
| 359 |
+
|
| 360 |
try:
|
| 361 |
env = _replay_env(ep_seed, ep_history, ep_mode)
|
| 362 |
true_belief = env.get_belief_target()
|
| 363 |
+
# Iter 4 fix (Issue 4): subtract the constant-baseline reward
|
| 364 |
mae = sum(abs(b - t) for b, t in zip(belief, true_belief)) / 3.0
|
| 365 |
+
similarity = 1.0 - mae
|
| 366 |
+
baseline_mae = sum(abs(0.5 - t) for t in true_belief) / 3.0
|
| 367 |
+
baseline_similarity = 1.0 - baseline_mae
|
| 368 |
+
# Reward = how much better than the constant-emit baseline
|
| 369 |
+
scores.append(similarity - baseline_similarity)
|
| 370 |
except Exception:
|
| 371 |
scores.append(0.0)
|
| 372 |
|