{% extends "layout.html" %} {% block content %}
Story-style intuition: The Immediate Feedback
Imagine a mouse in a maze. The Reward is the immediate, tangible feedback it gets for its actions. If it takes a step and finds a tiny crumb of cheese, it gets an immediate `+1` reward. If it touches an electric wire, it gets an immediate `-10` reward. If it just moves to an empty square, it gets a small `-0.1` reward (to encourage it to hurry). The reward signal is the fundamental way the environment tells the agent, "What you just did was good/bad."
The Reward (R) is a scalar feedback signal that the environment provides to the agent after each action. It is the primary driver of learning, as the agent's ultimate goal is to maximize the total reward it accumulates over time.
Example: In a video game, picking up a health pack gives a `+25` reward.
Example: A self-driving car receiving a `-100` reward for a collision.
Example: In chess, most moves don't immediately win or lose the game, so they receive a reward of `0`.
Story-style intuition: The Long-Term Goal
The mouse in the maze isn't just trying to get the next crumb of cheese; its real goal is to get the big block of cheese at the end. The Return (G) is the total sum of all the rewards the mouse expects to get from its current position until the end of the maze. A smart mouse will choose a path of small negative rewards (empty steps) if it knows that path leads to the huge `+1000` reward of the final cheese block. It learns to prioritize the path with the highest Return, not just the highest immediate reward.
The Return (G) is the cumulative sum of future rewards. Because the future is uncertain and rewards that are far away are often less valuable than immediate ones, we use a discount factor (γ).
$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$
The discount factor \( \gamma \) (a number between 0 and 1) determines the present value of future rewards. A \( \gamma \) of 0.9 means a reward received in the next step is worth 90% of its value now, a reward in two steps is worth 81%, and so on.
Story-style intuition: The Chess Master's Insight
A novice chess player only sees the immediate rewards (e.g., "I can capture their pawn!"). A chess master, however, understands the Value of a board position. A certain position might not offer any immediate captures, but the master knows it has a high value because it provides strong control over the center of the board and is highly likely to lead to a win (a large future return) later on. The Value Function is this deep, predictive understanding of "how good" a situation is in the long run.
A Value Function is a prediction of the expected future return. It is the core of many RL algorithms, as it allows the agent to make decisions based on the long-term consequences of its actions.
Answers the question: "How good is it to be in this state?"
$$ V^\pi(s) = \mathbb{E}_\pi [G_t \mid S_t = s] $$
This is the expected return an agent can get if it starts in state \(s\) and follows its policy \( \pi \) thereafter.
Example: In Pac-Man, the state-value \( V(s) \) of a position surrounded by pellets is high. The value of a position where Pac-Man is cornered by a ghost is very low.
Answers the question: "How good is it to take this specific action in this state?"
$$ Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid S_t = s, A_t = a] $$
This is the expected return if the agent starts in state \(s\), takes action \(a\), and then follows its policy \( \pi \) from that point on. The Q-function is often more useful for decision-making because for any state, the agent can simply choose the action with the highest Q-value.
Example: You are Pac-Man at an intersection (state s). The Q-function would give you values for each action: \( Q(s, \text{move left}) = +50 \), \( Q(s, \text{move right}) = -200 \) (because a ghost is there). You would obviously choose to move left.
| Aspect | Reward (R) | Value Function (V or Q) |
|---|---|---|
| Timing | Immediate and short-term. | Long-term prediction of future rewards. |
| Source | Provided directly by the environment. | Estimated by the agent based on its experience. |
| Purpose | Defines the fundamental goal of the task. | Used to guide the agent's policy toward that goal. |
| Analogy | The `+1` point you get for eating a pellet in Pac-Man. | Your internal estimate of the final high score you are likely to get from your current position. |
Example: An AI agent rewarded for winning a boat race discovered a bug where it could go in circles and collect turbo boosts infinitely, never finishing the race but accumulating a huge score. It maximized the reward signal, but not in the way the designers intended.