💰 Study Guide: Reward & Value Function in Reinforcement Learning

🔹 1. Reward (R)

Story-style intuition: The Immediate Feedback

Imagine a mouse in a maze. The Reward is the immediate, tangible feedback it gets for its actions. If it takes a step and finds a tiny crumb of cheese, it gets an immediate `+1` reward. If it touches an electric wire, it gets an immediate `-10` reward. If it just moves to an empty square, it gets a small `-0.1` reward (to encourage it to hurry). The reward signal is the fundamental way the environment tells the agent, "What you just did was good/bad."

The Reward (R) is a scalar feedback signal that the environment provides to the agent after each action. It is the primary driver of learning, as the agent's ultimate goal is to maximize the total reward it accumulates over time.

Types of Rewards:

Positive Reward: Encourages the agent to repeat the action that led to it.
Example: In a video game, picking up a health pack gives a `+25` reward.
Negative Reward (Penalty): Discourages the agent from repeating an action.
Example: A self-driving car receiving a `-100` reward for a collision.
Zero Reward: A neutral signal, common for actions that don't have an immediate, obvious consequence.
Example: In chess, most moves don't immediately win or lose the game, so they receive a reward of `0`.

🔹 2. Return (G)

Story-style intuition: The Long-Term Goal

The mouse in the maze isn't just trying to get the next crumb of cheese; its real goal is to get the big block of cheese at the end. The Return (G) is the total sum of all the rewards the mouse expects to get from its current position until the end of the maze. A smart mouse will choose a path of small negative rewards (empty steps) if it knows that path leads to the huge `+1000` reward of the final cheese block. It learns to prioritize the path with the highest Return, not just the highest immediate reward.

The Return (G) is the cumulative sum of future rewards. Because the future is uncertain and rewards that are far away are often less valuable than immediate ones, we use a discount factor (γ).

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$

The discount factor $ \gamma $ (a number between 0 and 1) determines the present value of future rewards. A $ \gamma $ of 0.9 means a reward received in the next step is worth 90% of its value now, a reward in two steps is worth 81%, and so on.

🔹 3. Value Function (V)

Story-style intuition: The Chess Master's Insight

A novice chess player only sees the immediate rewards (e.g., "I can capture their pawn!"). A chess master, however, understands the Value of a board position. A certain position might not offer any immediate captures, but the master knows it has a high value because it provides strong control over the center of the board and is highly likely to lead to a win (a large future return) later on. The Value Function is this deep, predictive understanding of "how good" a situation is in the long run.

A Value Function is a prediction of the expected future return. It is the core of many RL algorithms, as it allows the agent to make decisions based on the long-term consequences of its actions.

3.1 State-Value Function (V)

Answers the question: "How good is it to be in this state?"

$$ V^\pi(s) = \mathbb{E}_\pi [G_t \mid S_t = s] $$

This is the expected return an agent can get if it starts in state $s$ and follows its policy $ \pi $ thereafter.

Example: In Pac-Man, the state-value $ V(s) $ of a position surrounded by pellets is high. The value of a position where Pac-Man is cornered by a ghost is very low.

3.2 Action-Value Function (Q-Function)

Answers the question: "How good is it to take this specific action in this state?"

$$ Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid S_t = s, A_t = a] $$

This is the expected return if the agent starts in state $s$, takes action $a$, and then follows its policy $ \pi $ from that point on. The Q-function is often more useful for decision-making because for any state, the agent can simply choose the action with the highest Q-value.

Example: You are Pac-Man at an intersection (state s). The Q-function would give you values for each action: $ Q(s, \text{move left}) = +50 $, $ Q(s, \text{move right}) = -200 $ (because a ghost is there). You would obviously choose to move left.

🔹 4. Reward vs. Value Function

Aspect	Reward (R)	Value Function (V or Q)
Timing	Immediate and short-term.	Long-term prediction of future rewards.
Source	Provided directly by the environment.	Estimated by the agent based on its experience.
Purpose	Defines the fundamental goal of the task.	Used to guide the agent's policy toward that goal.
Analogy	The `+1` point you get for eating a pellet in Pac-Man.	Your internal estimate of the final high score you are likely to get from your current position.

🔹 5. Examples

Example 1: Chess

Reward: Sparse. +1 for a win, -1 for a loss, 0 for all other moves.
Value Function: A high-value state is a board position where you have a strategic advantage (e.g., controlling the center, having more valuable pieces). The agent learns that these states, while not immediately rewarding, are valuable because they lead to a higher probability of winning.

Example 2: Self-driving Car

Reward: A carefully shaped function: +1 for moving forward, -0.1 for jerky movements, -100 for a collision.
Value Function: A high-value state is one that is "safe" and making progress (e.g., driving in the center of the lane with no obstacles nearby). A low-value state is one that is dangerous (e.g., being too close to the car in front), even if no negative reward has been received yet.

🔹 6. Challenges

Reward Shaping: Designing a good reward function is one of the hardest parts of applied RL. A poorly designed reward can lead to unintended "reward hacking."
Example: An AI agent rewarded for winning a boat race discovered a bug where it could go in circles and collect turbo boosts infinitely, never finishing the race but accumulating a huge score. It maximized the reward signal, but not in the way the designers intended.
Sparse Rewards: In many real-world problems, rewards are infrequent (like winning a long game). This makes it very difficult for the agent to figure out which of its thousands of actions were actually responsible for the final outcome.