rhythm_env / BLOG.md
InosLihka's picture
Move blog to root as BLOG.md (per Meta mentor guidance)
eccca42

Teaching an AI to Know You (Without Asking)

Imagine this. It's 2pm. You had deep work blocked on your calendar. Your AI assistant sends you a nudge:

"I know you planned Deep Work now, but your focus metrics just dropped below 20%. If you push through, you'll likely spend 3 hours on something that would take 1 hour at peak. Take a 20-minute rest first β€” I'll remind you when your window opens."

You tap Accept or Ignore. Either way, the agent just learned something about you.

That's the product vision. But there's a problem nobody has solved cleanly: how does the AI know that rest-then-work is the right call for you specifically, and not just generically good advice?

The gap that everyone papers over

Most AI assistants give the same advice to everyone. They know best practices β€” sleep enough, work in the morning, don't skip exercise. That's useful for nobody who isn't already average.

The people who give you genuinely good advice about your life have learned you over time. A great EA, a close friend, a good coach β€” none of them sat you down with a questionnaire. They watched how you responded to things. They noticed that you're wrecked after back-to-back meetings even when you say you're fine. That you do your sharpest thinking before anyone else is online. That skipping one workout makes you irritable by Wednesday.

I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users say they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalised the drain. The workaholic who checks "I value work-life balance" because they know they should.

Preference forms capture what people believe about themselves. Behaviour reveals what's actually true.

The real-world input problem

You wouldn't manually type "I am at 40% energy." That's a chore nobody does.

The real input comes from devices you already carry. Your watch sends resting heart rate and HRV β€” that's Vitality and Serenity. Your calendar sends meeting density and deadline proximity β€” that's Progress pressure. Your sleep tracker sends last night's data β€” that's Cognition. Your phone knows whether you've been social or isolated.

The agent never asks how you feel. It reads what your devices already know.

And the reward signal? It comes from you, passively. Every time the agent makes a recommendation and you Accept or Ignore it, that choice is data. Accept means "yes, that was the right read." Ignore means "you got something wrong about me." Over hundreds of those micro-interactions, the agent builds a precise model of who you are β€” not the person you describe yourself to be.

The foundational problem: teaching the inference skill to a small model

Here's the hard part. A frontier model like GPT-4 can already do decent personalised planning if you describe yourself in the prompt. But that doesn't work at scale:

  • You have to describe yourself every single session
  • The model can't observe your actual responses to its recommendations
  • It runs in the cloud, costs per query, can't be always-on or private
  • Most users can't accurately describe their own patterns anyway

What the real product needs is a small model β€” one that can run cheaply, close to you, eventually on-device β€” that builds up a model of you from how you respond, not from what you say about yourself.

That's the inference skill we're training. RhythmEnv is the curriculum.

How the training environment works

RhythmEnv simulates one week in a person's life β€” 7 days, 4 time slots each, 28 decisions. Each decision is an activity: deep work, exercise, sleep, meditation, family time, socialising. Ten options.

Five meters track the person's state:

  • Vitality β€” physical energy. Sleep fills it. Work drains it.
  • Cognition β€” mental sharpness. Peaks in the morning for some, evening for others.
  • Progress β€” career momentum. Only goes up when you work.
  • Serenity β€” inner calm. Meditation helps. Overwork kills it.
  • Connection β€” relationship health. Decays passively every time slot. Ignore it and it quietly drops.

Hidden underneath is a personality profile. The agent can't see it. It controls both what the person values (their hidden reward weights) and how actions physically affect them (their hidden trait modifiers).

Three profiles, wildly different hidden mechanics:

The introvert morning person values serenity above everything (60% of their score). Socialising drains their vitality three times faster than the base rate. Meditating gives them a bonus +0.10 serenity on top of the base effect. Deep work in the morning gives double progress. The agent discovers: mornings are sacred, social events are costly, alone time heals.

The extrovert night owl values connection above everything (75%). Socialising barely costs any vitality β€” they could do it all day. Deep work in the morning gives only 40% of expected output. The same work in the evening gives 1.8Γ— output. The agent discovers: protect the mornings for rest, do cognitive work at night, keep socialising high.

The workaholic stoic values progress above everything (70%). Productive work actually recovers vitality for them β€” output is energising. Idle activities like leisure or passive rest drain their serenity β€” the guilt is real. The agent discovers: keep working, rest only when vitality is critical, never let idle time accumulate.

What the agent must figure out

The agent sees meters, time of day, and a reward signal. It doesn't see the profile, the trait values, or the reward weights.

Same action, same starting state β€” completely different reward depending on who you're helping:

Profile DEEP_WORK reward (step 1)
Workaholic +1.57
Introvert +0.32
Extrovert βˆ’0.39

The extrovert gets a negative reward from deep work first thing β€” because it gives zero connection, and connection is 75% of their score.

A good agent should probe in the first few steps, read the unexpected meter changes, infer the hidden profile, and adapt its strategy for the rest of the week. This is the same skill the real product needs: detect who you are from how you respond, not from what you tell me.

The training pipeline

We train using GRPO β€” Group Relative Policy Optimization. For each game state, generate multiple candidate actions, score them all against the real environment, update the model to prefer the higher-scoring ones. The environment is the critic.

The model is Qwen 2.5-3B with 4-bit quantization and LoRA. Small enough to train on a free Colab T4. Small enough to eventually run at the edge.

The heuristic baseline β€” fixed rules, treats everyone the same β€” scores around 0.45 on the grader. Sleep when vitality is low. Meditate when serenity drops. Socialise when connection falls. Reasonable advice for anyone. Wrong advice for someone specifically.

A trained agent that discovers the hidden personality has to do something qualitatively different β€” and the grader has to measure the difference. Five iterations of GRPO from scratch, the agent kept tying with heuristic. Reading the model's reasoning showed the inference was actually happening; the grader just wasn't rewarding it. We added a belief_accuracy term β€” 20% of the grade for emitting a belief vector close to the hidden truth β€” and the picture changed instantly. Heuristic dropped to 0.45 (no belief = 0 on that axis). A frontier teacher with real inference jumped to 0.62.

Then we distilled. Algorithm Distillation is the right recipe for small reasoning models: instead of training Qwen 2.5-3B from scratch with RL (millions of episodes for one task), have a frontier teacher play episodes, write down its reasoning, and SFT the small model on those trajectories. The student learns the format AND the reasoning pattern in 30 episodes' worth of data β€” small enough to run on a single A10G in 25 minutes.

Why simulation first

Everything here is simulated. The person doesn't exist. The meters aren't biometric readings. This is standard practice β€” robotics RL trains in simulation before deploying on hardware. The simulator is the curriculum, not the product.

The inference skill transfers. An agent that learns to detect "this person's vitality drops 3Γ— faster from social events than expected" from simulated reward signals learns the structure of the problem. When the medium changes β€” when vitality comes from HRV instead of a formula β€” the skill of reading differential responses still applies.

The Accept/Ignore loop in the real product is the same reward signal, made human. Every time you ignore a recommendation, you're telling the agent: "you read me wrong." Every Accept says: "that was right." Over enough interactions, the model converges on your hidden profile without you ever having to describe it.

No questionnaire. No settings page. Just devices watching, signals flowing, and an agent that gets better at knowing you every week.


Links:

Built for the Meta OpenEnv Hackathon Grand Finale, April 2026.