{% extends "layout.html" %} {% block content %} Study Guide: RL Reward & Value Function

💰 Study Guide: Reward & Value Function in Reinforcement Learning

🔹 1. Reward (R)

Story-style intuition: The Immediate Feedback

Imagine a mouse in a maze. The Reward is the immediate, tangible feedback it gets for its actions. If it takes a step and finds a tiny crumb of cheese, it gets an immediate `+1` reward. If it touches an electric wire, it gets an immediate `-10` reward. If it just moves to an empty square, it gets a small `-0.1` reward (to encourage it to hurry). The reward signal is the fundamental way the environment tells the agent, "What you just did was good/bad."

The Reward (R) is a scalar feedback signal that the environment provides to the agent after each action. It is the primary driver of learning, as the agent's ultimate goal is to maximize the total reward it accumulates over time.

Types of Rewards:

🔹 2. Return (G)

Story-style intuition: The Long-Term Goal

The mouse in the maze isn't just trying to get the next crumb of cheese; its real goal is to get the big block of cheese at the end. The Return (G) is the total sum of all the rewards the mouse expects to get from its current position until the end of the maze. A smart mouse will choose a path of small negative rewards (empty steps) if it knows that path leads to the huge `+1000` reward of the final cheese block. It learns to prioritize the path with the highest Return, not just the highest immediate reward.

The Return (G) is the cumulative sum of future rewards. Because the future is uncertain and rewards that are far away are often less valuable than immediate ones, we use a discount factor (γ).

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$

The discount factor \( \gamma \) (a number between 0 and 1) determines the present value of future rewards. A \( \gamma \) of 0.9 means a reward received in the next step is worth 90% of its value now, a reward in two steps is worth 81%, and so on.

🔹 3. Value Function (V)

Story-style intuition: The Chess Master's Insight

A novice chess player only sees the immediate rewards (e.g., "I can capture their pawn!"). A chess master, however, understands the Value of a board position. A certain position might not offer any immediate captures, but the master knows it has a high value because it provides strong control over the center of the board and is highly likely to lead to a win (a large future return) later on. The Value Function is this deep, predictive understanding of "how good" a situation is in the long run.

A Value Function is a prediction of the expected future return. It is the core of many RL algorithms, as it allows the agent to make decisions based on the long-term consequences of its actions.

3.1 State-Value Function (V)

Answers the question: "How good is it to be in this state?"

$$ V^\pi(s) = \mathbb{E}_\pi [G_t \mid S_t = s] $$

This is the expected return an agent can get if it starts in state \(s\) and follows its policy \( \pi \) thereafter.

Example: In Pac-Man, the state-value \( V(s) \) of a position surrounded by pellets is high. The value of a position where Pac-Man is cornered by a ghost is very low.

3.2 Action-Value Function (Q-Function)

Answers the question: "How good is it to take this specific action in this state?"

$$ Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid S_t = s, A_t = a] $$

This is the expected return if the agent starts in state \(s\), takes action \(a\), and then follows its policy \( \pi \) from that point on. The Q-function is often more useful for decision-making because for any state, the agent can simply choose the action with the highest Q-value.

Example: You are Pac-Man at an intersection (state s). The Q-function would give you values for each action: \( Q(s, \text{move left}) = +50 \), \( Q(s, \text{move right}) = -200 \) (because a ghost is there). You would obviously choose to move left.

🔹 4. Reward vs. Value Function

Aspect Reward (R) Value Function (V or Q)
Timing Immediate and short-term. Long-term prediction of future rewards.
Source Provided directly by the environment. Estimated by the agent based on its experience.
Purpose Defines the fundamental goal of the task. Used to guide the agent's policy toward that goal.
Analogy The `+1` point you get for eating a pellet in Pac-Man. Your internal estimate of the final high score you are likely to get from your current position.

🔹 5. Examples

Example 1: Chess

Example 2: Self-driving Car

🔹 6. Challenges

{% endblock %}