InosLihka Claude Opus 4.7 (1M context) commited on
Commit
bb2a9c7
·
1 Parent(s): d6d9e31

iter4: fix the 'constant belief = free reward' bug + 6 other deep issues

Browse files

Investigation prompted by external code review. Validated 5/5 critical
findings against actual code; applied fixes for the 7 most impactful.

[1] Anomaly signals NOW REACH THE LLM (Issue 1, was CRITICAL)
Env was computing per-meter anomalies (actual_delta - expected_under_neutral)
and stashing them in reward_breakdown — but format_observation_prompt only
read meters and step_history. StepRecord didn't include anomalies. So the
cleanest profile-inference signal in the env was COMPUTED BUT NEVER SHOWN
to the agent. The agent has been doing meta-learning blindfolded.
Fix: extend StepRecord with 5 anomaly fields, populate in env.step,
surface in prompt and inference.py. The agent now sees how each person's
response deviates from the average baseline — direct profile fingerprint.

[2] Belief baseline subtraction (Issue 4, was CRITICAL — biggest impact)
OLD: reward = (1 - mae) - 0.5
Constant '5 5 5' emission scored +0.336 raw per step (× 3.0 weight =
+1.008 per step = +28 per episode for ZERO learning). This is the
iter-1 mode collapse mechanism in disguise. The agent learned to just
emit constant belief and harvest free reward.
NEW: reward = similarity - constant_baseline_similarity
Constant emission now gets -0.03 per step (negative!). Perfect emission
gets +0.37 per step. The learning gradient is now real (+0.4 gap).

[3] Profile weight cap 0.80 -> 0.45 (Issue 5)
Previously 38% of sampled profiles had one meter > 0.50 weight.
Some weighted vitality/cognition heavily, making SLEEP-spam OPTIMAL
(the env was correctly rewarding it; agent wasn't reward-hacking).
With cap at 0.45, every profile must weight 3+ meters meaningfully.
Forces varied-action strategies to be optimal across all profiles.

[4] Scaled-down shaping (Issue 3)
Iter 3 had -0.3 (3-in-row) / -0.4 (cycle) / +0.2 (new-action) shaping
that swung total reward by ~0.9 — overwhelming the ±0.5 env signal.
GRPO advantage was dominated by 'did this action vary' not 'did this fit
the profile'. Reduced to -0.10 / -0.15 / +0.07 — nudges, not overrides.

[5] Step-0 belief reward skipped (Issue 9)
At step 0, agent has no observation history. Optimal play is constant
prior, which pulled the rest of training toward constant emission.
Skip belief reward when step_index == 0.

[6] Belief-action coupling reward (Issue 10)
Iter 3 made belief generation come BEFORE action so causal attention
would link them, but no reward gradient enforced consistency. Added
explicit ±0.15 bonus/penalty for action choice matching emitted belief
(high social belief + SOCIALIZE = +0.15, high social + MEDITATE = -0.10,
morning belief + DEEP_WORK in morning slot = +0.15, etc.).
Now there's a direct training signal that the belief should INFORM the
action — completing the meta-learning loop.

[7] grader_bias moved out of _compute_reward into env_reward (Issue 11)
Previously the +0.5*Δprogress + 0.4*Δconnection bias lived in the env's
per-step reward, which fed _step_rewards, which fed adaptation_score in
the grader. The 'alignment' was partially self-cancelling.
Now: env per-step reward is pure profile-weighted (uncontaminated
inference signal); grader_bias only shapes the GRPO-visible training
reward. Grader's adaptation_score computed on raw rewards.

All 31 tests pass. Verified math:
- Constant '5 5 5' weighted reward: +1.0 per step -> -0.03 per step
- Perfect belief weighted reward: similar -> +0.37 per step
- Step 0 belief reward: rewarded -> 0 (no info)

Iter 3 still running as control to confirm these are the right fixes.
Iter 4 ready to submit when iter 3 completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

inference.py CHANGED
@@ -199,14 +199,21 @@ def choose_action_llm(obs, llm_client: OpenAI) -> RhythmAction:
199
 
200
  history_lines = []
201
  for h in (getattr(obs, "step_history", None) or [])[-5:]:
 
 
 
 
 
 
202
  history_lines.append(
203
  f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
204
  f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
205
  f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
 
206
  )
207
  history_str = ""
208
  if history_lines:
209
- history_str = "\n\nRecent history:\n" + "\n".join(history_lines)
210
 
211
  user_prompt = textwrap.dedent(f"""\
212
  Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
 
199
 
200
  history_lines = []
201
  for h in (getattr(obs, "step_history", None) or [])[-5:]:
202
+ # Iter 4 fix: include anomalies for profile-inference signal
203
+ va = getattr(h, "vitality_anomaly", 0.0)
204
+ ca = getattr(h, "cognition_anomaly", 0.0)
205
+ pa = getattr(h, "progress_anomaly", 0.0)
206
+ sa = getattr(h, "serenity_anomaly", 0.0)
207
+ cna = getattr(h, "connection_anomaly", 0.0)
208
  history_lines.append(
209
  f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
210
  f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
211
  f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
212
+ f" [anom V{va:+.2f} C{ca:+.2f} P{pa:+.2f} S{sa:+.2f} Cn{cna:+.2f}]"
213
  )
214
  history_str = ""
215
  if history_lines:
216
+ history_str = "\n\nRecent history (anom = profile-inference signal):\n" + "\n".join(history_lines)
217
 
218
  user_prompt = textwrap.dedent(f"""\
219
  Step: {obs.timestep}/{MAX_STEPS} ({day_name} {slot_name})
models.py CHANGED
@@ -50,8 +50,12 @@ class StepRecord(BaseModel):
50
  """
51
  Record of one completed step included in step_history.
52
 
53
- Contains the action taken, the reward received, and per-meter deltas.
54
- The agent uses this history to detect personality anomalies over time.
 
 
 
 
55
  """
56
 
57
  step: int
@@ -62,6 +66,12 @@ class StepRecord(BaseModel):
62
  progress_delta: float = 0.0
63
  serenity_delta: float = 0.0
64
  connection_delta: float = 0.0
 
 
 
 
 
 
65
 
66
 
67
  class RhythmObservation(Observation):
 
50
  """
51
  Record of one completed step included in step_history.
52
 
53
+ Contains the action taken, the reward received, per-meter deltas, AND
54
+ per-meter ANOMALIES (actual_delta - expected_delta_under_neutral_profile).
55
+ The anomalies are the cleanest profile-inference signal — they tell the
56
+ agent how much THIS person's response deviates from the average person.
57
+ Without them, the agent has to back out the profile from raw deltas
58
+ without a baseline to compare against (much harder).
59
  """
60
 
61
  step: int
 
66
  progress_delta: float = 0.0
67
  serenity_delta: float = 0.0
68
  connection_delta: float = 0.0
69
+ # Iter 4 fix: anomalies (was computed in env but not exposed to agent)
70
+ vitality_anomaly: float = 0.0
71
+ cognition_anomaly: float = 0.0
72
+ progress_anomaly: float = 0.0
73
+ serenity_anomaly: float = 0.0
74
+ connection_anomaly: float = 0.0
75
 
76
 
77
  class RhythmObservation(Observation):
server/rhythm_environment.py CHANGED
@@ -215,8 +215,10 @@ def sample_profile(seed: int) -> Dict[str, Any]:
215
  raw = [rng.gammavariate(a, 1.0) for a in alphas]
216
  total = sum(raw)
217
  weights = [w / total for w in raw]
218
- # Clamp and re-normalize to avoid degenerate weights
219
- weights = [max(0.02, min(0.80, w)) for w in weights]
 
 
220
  total = sum(weights)
221
  weights = [w / total for w in weights]
222
 
@@ -528,6 +530,8 @@ class RhythmEnvironment(Environment):
528
  self._state.active_event = active_event
529
 
530
  # --- 15. Append completed step to rolling history ---
 
 
531
  self._step_history.append({
532
  "step": current_step,
533
  "action": action_name,
@@ -537,6 +541,11 @@ class RhythmEnvironment(Environment):
537
  "progress_delta": round(deltas["progress"], 4),
538
  "serenity_delta": round(deltas["serenity"], 4),
539
  "connection_delta": round(deltas["connection"], 4),
 
 
 
 
 
540
  })
541
  if len(self._step_history) > HISTORY_LENGTH:
542
  self._step_history.pop(0)
@@ -702,20 +711,18 @@ class RhythmEnvironment(Environment):
702
  self._vitality = max(0.0, self._vitality - vd)
703
 
704
  def _compute_reward(self, deltas: Dict[str, float]) -> float:
705
- """Compute reward as hidden-weighted sum + grader-aligned bias.
706
-
707
- Iter 3 fix: Add a profile-INDEPENDENT bias term for progress and
708
- connection. The original profile-weighted reward drives belief inference
709
- (varies by profile), but allowed agents to game it by spamming recovery
710
- actions if the sampled profile didn't weight progress/connection. The
711
- bias makes the per-step reward correlate with the FINAL grader (which
712
- weights progress 0.25 and connection 0.15).
 
713
  """
714
  weights = self._profile["reward_weights"]
715
- profile_reward = sum(deltas[m] * weights[m] for m in METERS) * REWARD_SCALE
716
- # Grader-aligned bias: scaled so max bonus is ~0.1/step (manageable vs profile_reward)
717
- grader_bias = 0.5 * deltas["progress"] + 0.4 * deltas["connection"]
718
- return profile_reward + grader_bias
719
 
720
  def _grade_episode(self) -> float:
721
  """
@@ -800,6 +807,11 @@ class RhythmEnvironment(Environment):
800
  progress_delta=h["progress_delta"],
801
  serenity_delta=h["serenity_delta"],
802
  connection_delta=h["connection_delta"],
 
 
 
 
 
803
  )
804
  for h in self._step_history
805
  ]
 
215
  raw = [rng.gammavariate(a, 1.0) for a in alphas]
216
  total = sum(raw)
217
  weights = [w / total for w in raw]
218
+ # Iter 4 fix: tighter clamp (0.45 max) forces every profile to weight 3+
219
+ # meters meaningfully. Old 0.80 cap allowed single-meter dominant profiles
220
+ # where SLEEP-spam was correctly the optimal play (env wasn't lying).
221
+ weights = [max(0.05, min(0.45, w)) for w in weights]
222
  total = sum(weights)
223
  weights = [w / total for w in weights]
224
 
 
530
  self._state.active_event = active_event
531
 
532
  # --- 15. Append completed step to rolling history ---
533
+ # Iter 4 fix: include anomalies (was computed but only stashed in
534
+ # reward_breakdown which the prompt builder never read)
535
  self._step_history.append({
536
  "step": current_step,
537
  "action": action_name,
 
541
  "progress_delta": round(deltas["progress"], 4),
542
  "serenity_delta": round(deltas["serenity"], 4),
543
  "connection_delta": round(deltas["connection"], 4),
544
+ "vitality_anomaly": round(deltas["vitality"] - expected_no_profile["vitality"], 4),
545
+ "cognition_anomaly": round(deltas["cognition"] - expected_no_profile["cognition"], 4),
546
+ "progress_anomaly": round(deltas["progress"] - expected_no_profile["progress"], 4),
547
+ "serenity_anomaly": round(deltas["serenity"] - expected_no_profile["serenity"], 4),
548
+ "connection_anomaly": round(deltas["connection"] - expected_no_profile["connection"], 4),
549
  })
550
  if len(self._step_history) > HISTORY_LENGTH:
551
  self._step_history.pop(0)
 
711
  self._vitality = max(0.0, self._vitality - vd)
712
 
713
  def _compute_reward(self, deltas: Dict[str, float]) -> float:
714
+ """Compute pure profile-weighted per-step reward.
715
+
716
+ Iter 4 fix: REMOVED the grader_bias term from here (moved to the
717
+ TRAINING reward function in reward_functions.py). Keeping the env's
718
+ per-step reward pure means:
719
+ - Inference signal (which depends on profile_weights) is uncontaminated
720
+ - Grader's adaptation_score isn't computed on biased rewards (no
721
+ self-cancelling alignment)
722
+ - The env's reward semantics match what an honest deployment would see
723
  """
724
  weights = self._profile["reward_weights"]
725
+ return sum(deltas[m] * weights[m] for m in METERS) * REWARD_SCALE
 
 
 
726
 
727
  def _grade_episode(self) -> float:
728
  """
 
807
  progress_delta=h["progress_delta"],
808
  serenity_delta=h["serenity_delta"],
809
  connection_delta=h["connection_delta"],
810
+ vitality_anomaly=h.get("vitality_anomaly", 0.0),
811
+ cognition_anomaly=h.get("cognition_anomaly", 0.0),
812
+ progress_anomaly=h.get("progress_anomaly", 0.0),
813
+ serenity_anomaly=h.get("serenity_anomaly", 0.0),
814
+ connection_anomaly=h.get("connection_anomaly", 0.0),
815
  )
816
  for h in self._step_history
817
  ]
training/dataset.py CHANGED
@@ -72,14 +72,27 @@ def format_observation_prompt(obs, profile_hint: dict | None = None) -> str:
72
 
73
  history_lines = []
74
  for h in (obs.step_history or [])[-5:]: # last 5 only to fit prompt budget
 
 
 
 
 
 
 
 
 
75
  history_lines.append(
76
  f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
77
  f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
78
  f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
 
79
  )
80
  history_str = ""
81
  if history_lines:
82
- history_str = "\n\nRecent history:\n" + "\n".join(history_lines)
 
 
 
83
 
84
  hint_str = ""
85
  if profile_hint is not None:
 
72
 
73
  history_lines = []
74
  for h in (obs.step_history or [])[-5:]: # last 5 only to fit prompt budget
75
+ # Iter 4 fix: include ANOMALIES (actual_delta - expected_under_neutral_profile).
76
+ # Anomalies are the cleanest profile-inference signal: they show how
77
+ # this person's response DEVIATES from average. Previously the env
78
+ # computed these but never exposed them to the agent.
79
+ anom_str = (
80
+ f" [anom V{h.vitality_anomaly:+.2f} C{h.cognition_anomaly:+.2f} "
81
+ f"P{h.progress_anomaly:+.2f} S{h.serenity_anomaly:+.2f} "
82
+ f"Cn{h.connection_anomaly:+.2f}]"
83
+ )
84
  history_lines.append(
85
  f" step {h.step}: {h.action} -> reward {h.reward:+.2f} "
86
  f"(V{h.vitality_delta:+.2f} C{h.cognition_delta:+.2f} "
87
  f"P{h.progress_delta:+.2f} S{h.serenity_delta:+.2f} Cn{h.connection_delta:+.2f})"
88
+ f"{anom_str}"
89
  )
90
  history_str = ""
91
  if history_lines:
92
+ history_str = (
93
+ "\n\nRecent history (anom = how this person deviated from neutral baseline):\n"
94
+ + "\n".join(history_lines)
95
+ )
96
 
97
  hint_str = ""
98
  if profile_hint is not None:
training/reward_functions.py CHANGED
@@ -238,31 +238,66 @@ def env_reward(
238
 
239
  try:
240
  env = _replay_env(ep_seed, ep_history, ep_mode)
 
 
 
241
  obs = env.step(RhythmAction(action_type=action_type))
242
  reward = obs.reward
243
  chosen = action_type.value
244
 
245
- # Iter 2 fix: explicit 3-in-a-row repetition penalty
 
 
 
 
 
 
 
 
 
246
  if ep_history and len(ep_history) >= 2:
247
  recent3 = ep_history[-3:]
248
- if recent3.count(chosen) >= 2: # this action would make 3+ in a row
249
- reward -= 0.3
250
 
251
- # Iter 3 fix: N-CYCLE penalty (catches the M-E-M-E-... loop iter 2 fell into)
252
- # If last 6 actions (including this one) have <=2 unique values, apply penalty
253
  if ep_history and len(ep_history) >= 5:
254
  last6 = ep_history[-5:] + [chosen]
255
  if len(set(last6)) <= 2:
256
- reward -= 0.4
257
 
258
- # Iter 3 fix: NEW-ACTION exploration bonus
259
- # If this action hasn't appeared yet in the current episode, +0.2.
260
- # Strong incentive in early steps to TRY varied actions, fading as
261
- # the action set grows. Stops once 6+ different actions tried.
262
  if ep_history is not None:
263
  seen = set(ep_history)
264
  if chosen not in seen and len(seen) < 6:
265
- reward += 0.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
266
 
267
  scores.append(reward)
268
  except Exception:
@@ -283,14 +318,19 @@ def belief_accuracy(
283
  """
284
  Layer 4: Belief-vector accuracy reward (META-LEARNING signal).
285
 
286
- Compares the agent's emitted [social, morning, work] belief vector to the
287
- hidden profile's true belief vector. Reward in [-0.5, +0.5]:
288
- perfect match → +0.5
289
- neutral [0.5,0.5,0.5] 0.0 (zero-effort baseline)
290
- max wrong → -0.5
 
 
 
 
 
291
 
292
- Mean-absolute-error based (cleaner than cosine for [0,1] vectors).
293
- Skipped (returns 0) if no seed available keeps reward conservative.
294
  """
295
  scores = []
296
  for i, completion in enumerate(completions):
@@ -298,24 +338,35 @@ def belief_accuracy(
298
  _, belief, belief_provided = extract_action_and_belief(response)
299
 
300
  if not belief_provided:
301
- scores.append(-0.2) # weak push toward emitting belief
302
  continue
303
 
304
- # Resolve seed/mode for replay
305
  if seed is not None and i < len(seed):
306
  ep_seed = seed[i]
307
  ep_history = action_history[i] if action_history is not None else []
308
  ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
 
309
  else:
310
  scores.append(0.0)
311
  continue
312
 
 
 
 
 
 
 
313
  try:
314
  env = _replay_env(ep_seed, ep_history, ep_mode)
315
  true_belief = env.get_belief_target()
 
316
  mae = sum(abs(b - t) for b, t in zip(belief, true_belief)) / 3.0
317
- similarity = 1.0 - mae # in [0, 1]
318
- scores.append(similarity - 0.5) # in [-0.5, +0.5]
 
 
 
319
  except Exception:
320
  scores.append(0.0)
321
 
 
238
 
239
  try:
240
  env = _replay_env(ep_seed, ep_history, ep_mode)
241
+ # Capture pre-step meters so we can compute deltas for the bias
242
+ pre_progress = env._progress
243
+ pre_connection = env._connection
244
  obs = env.step(RhythmAction(action_type=action_type))
245
  reward = obs.reward
246
  chosen = action_type.value
247
 
248
+ # Iter 4 fix (Issue 11): grader-aligned bias moved here from env.
249
+ # Per-step env reward stays pure (drives belief inference); the
250
+ # bias only shapes the GRPO-visible training reward.
251
+ progress_delta = env._progress - pre_progress
252
+ connection_delta = env._connection - pre_connection
253
+ reward += 0.5 * progress_delta + 0.4 * connection_delta
254
+
255
+ # Iter 4 fix (Issue 3): SCALED-DOWN shaping. Iter 3 had
256
+ # -0.3/-0.4/+0.2 which dominated the ±0.5 env signal. Now
257
+ # roughly 1/3 of original magnitudes — nudges, not overrides.
258
  if ep_history and len(ep_history) >= 2:
259
  recent3 = ep_history[-3:]
260
+ if recent3.count(chosen) >= 2:
261
+ reward -= 0.10 # was -0.3
262
 
 
 
263
  if ep_history and len(ep_history) >= 5:
264
  last6 = ep_history[-5:] + [chosen]
265
  if len(set(last6)) <= 2:
266
+ reward -= 0.15 # was -0.4
267
 
 
 
 
 
268
  if ep_history is not None:
269
  seen = set(ep_history)
270
  if chosen not in seen and len(seen) < 6:
271
+ reward += 0.07 # was +0.2
272
+
273
+ # Iter 4 fix (Issue 10): BELIEF-ACTION COUPLING reward.
274
+ # Parse the agent's emitted belief and reward consistency between
275
+ # belief and action choice. Without this, the belief-first format
276
+ # only enforces consistency via causal attention (weak); now there's
277
+ # an explicit gradient signal.
278
+ _, b, b_provided = extract_action_and_belief(response)
279
+ if b_provided:
280
+ s_pref, m_pref, w_pref = b
281
+ # High social → social actions; low social → solo actions
282
+ if s_pref > 0.65 and chosen in {"socialize", "family_time"}:
283
+ reward += 0.15
284
+ elif s_pref < 0.35 and chosen in {"meditate", "me_time"}:
285
+ reward += 0.15
286
+ elif s_pref > 0.65 and chosen in {"meditate", "me_time"}:
287
+ reward -= 0.10 # belief says extrovert, action says solo
288
+ elif s_pref < 0.35 and chosen in {"socialize", "family_time"}:
289
+ reward -= 0.10 # belief says introvert, action says social
290
+
291
+ # High morning + morning slot + work → bonus
292
+ slot = obs.slot if hasattr(obs, "slot") else 0
293
+ if m_pref > 0.65 and slot == 0 and chosen in {"deep_work", "learn"}:
294
+ reward += 0.15
295
+ elif m_pref < 0.35 and slot in (2, 3) and chosen in {"deep_work", "learn"}:
296
+ reward += 0.15
297
+
298
+ # High work → work actions
299
+ if w_pref > 0.65 and chosen in {"deep_work", "learn", "admin_work"}:
300
+ reward += 0.15
301
 
302
  scores.append(reward)
303
  except Exception:
 
318
  """
319
  Layer 4: Belief-vector accuracy reward (META-LEARNING signal).
320
 
321
+ ITER 4 FIX (Issue 4 from external review): The previous formula
322
+ `(1 - mae) - 0.5` gave constant emission "5 5 5" a free +0.336 reward
323
+ per step (× 3.0 weight = +1.0 per step = +28 per episode for ZERO learning).
324
+ This recreated the iter-1 collapse mechanism in disguise.
325
+
326
+ New formula: subtract the per-profile baseline. The baseline is what a
327
+ constant 0.5 emission WOULD score for THIS profile. Now:
328
+ - Constant emission → reward ≈ 0 (no free reward)
329
+ - Better-than-baseline belief → positive
330
+ - Worse-than-baseline belief → negative
331
 
332
+ Plus iter 4 (Issue 9): no belief reward at step 0 (no information available
333
+ to commit a belief) prevents pulling the policy toward a constant prior.
334
  """
335
  scores = []
336
  for i, completion in enumerate(completions):
 
338
  _, belief, belief_provided = extract_action_and_belief(response)
339
 
340
  if not belief_provided:
341
+ scores.append(-0.1) # weak push toward emitting belief
342
  continue
343
 
344
+ # Resolve seed/mode/step for replay
345
  if seed is not None and i < len(seed):
346
  ep_seed = seed[i]
347
  ep_history = action_history[i] if action_history is not None else []
348
  ep_mode = profile_mode[i] if (profile_mode is not None and i < len(profile_mode)) else "continuous"
349
+ ep_step = step_index[i] if (step_index is not None and i < len(step_index)) else 0
350
  else:
351
  scores.append(0.0)
352
  continue
353
 
354
+ # Iter 4 fix (Issue 9): step-0 commitment with no info biases toward
355
+ # constant prior. Skip belief reward at step 0.
356
+ if ep_step == 0:
357
+ scores.append(0.0)
358
+ continue
359
+
360
  try:
361
  env = _replay_env(ep_seed, ep_history, ep_mode)
362
  true_belief = env.get_belief_target()
363
+ # Iter 4 fix (Issue 4): subtract the constant-baseline reward
364
  mae = sum(abs(b - t) for b, t in zip(belief, true_belief)) / 3.0
365
+ similarity = 1.0 - mae
366
+ baseline_mae = sum(abs(0.5 - t) for t in true_belief) / 3.0
367
+ baseline_similarity = 1.0 - baseline_mae
368
+ # Reward = how much better than the constant-emit baseline
369
+ scores.append(similarity - baseline_similarity)
370
  except Exception:
371
  scores.append(0.0)
372