MachineLearningAlgorithms / templates /Reward-ValueFunction.html
deedrop1140's picture
Upload 137 files
f7c7e26 verified
{% extends "layout.html" %}
{% block content %}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Study Guide: RL Reward & Value Function</title>
<!-- MathJax for rendering mathematical formulas -->
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<style>
/* General Body Styles */
body {
background-color: #ffffff; /* White background */
color: #000000; /* Black text */
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
font-weight: normal;
line-height: 1.8;
margin: 0;
padding: 20px;
}
/* Container for centering content */
.container {
max-width: 800px;
margin: 0 auto;
padding: 20px;
}
/* Headings */
h1, h2, h3 {
color: #000000;
border: none;
font-weight: bold;
}
h1 {
text-align: center;
border-bottom: 3px solid #000;
padding-bottom: 10px;
margin-bottom: 30px;
font-size: 2.5em;
}
h2 {
font-size: 1.8em;
margin-top: 40px;
border-bottom: 1px solid #ddd;
padding-bottom: 8px;
}
h3 {
font-size: 1.3em;
margin-top: 25px;
}
/* Main words are even bolder */
strong {
font-weight: 900;
}
/* Paragraphs and List Items with a line below */
p, li {
font-size: 1.1em;
border-bottom: 1px solid #e0e0e0; /* Light gray line below each item */
padding-bottom: 10px; /* Space between text and the line */
margin-bottom: 10px; /* Space below the line */
}
/* Remove bottom border from the last item in a list for cleaner look */
li:last-child {
border-bottom: none;
}
/* Ordered lists */
ol {
list-style-type: decimal;
padding-left: 20px;
}
ol li {
padding-left: 10px;
}
/* Unordered Lists */
ul {
list-style-type: none;
padding-left: 0;
}
ul li::before {
content: "•";
color: #000;
font-weight: bold;
display: inline-block;
width: 1em;
margin-left: 0;
}
/* Code block styling */
pre {
background-color: #f4f4f4;
border: 1px solid #ddd;
border-radius: 5px;
padding: 15px;
white-space: pre-wrap;
word-wrap: break-word;
font-family: "Courier New", Courier, monospace;
font-size: 0.95em;
font-weight: normal;
color: #333;
border-bottom: none;
}
/* RL Specific Styling */
.story-rl {
background-color: #f0faf5;
border-left: 4px solid #198754; /* Green accent */
margin: 15px 0;
padding: 10px 15px;
font-style: italic;
color: #555;
font-weight: normal;
border-bottom: none;
}
.story-rl p, .story-rl li {
border-bottom: none;
}
.example-rl {
background-color: #e9f7f1;
padding: 15px;
margin: 15px 0;
border-radius: 5px;
border-left: 4px solid #20c997; /* Lighter Green accent */
}
.example-rl p, .example-rl li {
border-bottom: none !important;
}
/* Table Styling */
table {
width: 100%;
border-collapse: collapse;
margin: 25px 0;
}
th, td {
border: 1px solid #ddd;
padding: 12px;
text-align: left;
}
th {
background-color: #f2f2f2;
font-weight: bold;
}
/* --- Mobile Responsive Styles --- */
@media (max-width: 768px) {
body, .container {
padding: 10px;
}
h1 { font-size: 2em; }
h2 { font-size: 1.5em; }
h3 { font-size: 1.2em; }
p, li { font-size: 1em; }
pre { font-size: 0.85em; }
table, th, td { font-size: 0.9em; }
}
</style>
</head>
<body>
<div class="container">
<h1>💰 Study Guide: Reward & Value Function in Reinforcement Learning</h1>
<h2>🔹 1. Reward (R)</h2>
<div class="story-rl">
<p><strong>Story-style intuition: The Immediate Feedback</strong></p>
<p>Imagine a mouse in a maze. The <strong>Reward</strong> is the immediate, tangible feedback it gets for its actions. If it takes a step and finds a tiny crumb of cheese, it gets an immediate `+1` reward. If it touches an electric wire, it gets an immediate `-10` reward. If it just moves to an empty square, it gets a small `-0.1` reward (to encourage it to hurry). The reward signal is the fundamental way the environment tells the agent, "What you just did was good/bad."</p>
</div>
<p>The <strong>Reward (R)</strong> is a scalar feedback signal that the environment provides to the agent after each action. It is the primary driver of learning, as the agent's ultimate goal is to maximize the total reward it accumulates over time.</p>
<h3>Types of Rewards:</h3>
<ul>
<li><strong>Positive Reward:</strong> Encourages the agent to repeat the action that led to it.
<div class="example-rl"><p><strong>Example:</strong> In a video game, picking up a health pack gives a `+25` reward.</p></div>
</li>
<li><strong>Negative Reward (Penalty):</strong> Discourages the agent from repeating an action.
<div class="example-rl"><p><strong>Example:</strong> A self-driving car receiving a `-100` reward for a collision.</p></div>
</li>
<li><strong>Zero Reward:</strong> A neutral signal, common for actions that don't have an immediate, obvious consequence.
<div class="example-rl"><p><strong>Example:</strong> In chess, most moves don't immediately win or lose the game, so they receive a reward of `0`.</p></div>
</li>
</ul>
<h2>🔹 2. Return (G)</h2>
<div class="story-rl">
<p><strong>Story-style intuition: The Long-Term Goal</strong></p>
<p>The mouse in the maze isn't just trying to get the next crumb of cheese; its real goal is to get the big block of cheese at the end. The <strong>Return (G)</strong> is the total sum of all the rewards the mouse expects to get from its current position until the end of the maze. A smart mouse will choose a path of small negative rewards (empty steps) if it knows that path leads to the huge `+1000` reward of the final cheese block. It learns to prioritize the path with the highest <strong>Return</strong>, not just the highest immediate reward.</p>
</div>
<p>The <strong>Return (G)</strong> is the cumulative sum of future rewards. Because the future is uncertain and rewards that are far away are often less valuable than immediate ones, we use a <strong>discount factor (γ)</strong>.</p>
<p>$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots $$</p>
<p>The discount factor \( \gamma \) (a number between 0 and 1) determines the present value of future rewards. A \( \gamma \) of 0.9 means a reward received in the next step is worth 90% of its value now, a reward in two steps is worth 81%, and so on.</p>
<h2>🔹 3. Value Function (V)</h2>
<div class="story-rl">
<p><strong>Story-style intuition: The Chess Master's Insight</strong></p>
<p>A novice chess player only sees the immediate rewards (e.g., "I can capture their pawn!"). A chess master, however, understands the <strong>Value</strong> of a board position. A certain position might not offer any immediate captures, but the master knows it has a high value because it provides strong control over the center of the board and is highly likely to lead to a win (a large future return) later on. The <strong>Value Function</strong> is this deep, predictive understanding of "how good" a situation is in the long run.</p>
</div>
<p>A <strong>Value Function</strong> is a prediction of the expected future return. It is the core of many RL algorithms, as it allows the agent to make decisions based on the long-term consequences of its actions.</p>
<h3>3.1 State-Value Function (V)</h3>
<p>Answers the question: "How good is it to be in this state?"</p>
<p>$$ V^\pi(s) = \mathbb{E}_\pi [G_t \mid S_t = s] $$</p>
<p>This is the expected return an agent can get if it starts in state \(s\) and follows its policy \( \pi \) thereafter.</p>
<div class="example-rl">
<p><strong>Example:</strong> In Pac-Man, the state-value \( V(s) \) of a position surrounded by pellets is high. The value of a position where Pac-Man is cornered by a ghost is very low.</p>
</div>
<h3>3.2 Action-Value Function (Q-Function)</h3>
<p>Answers the question: "How good is it to take this specific action in this state?"</p>
<p>$$ Q^\pi(s, a) = \mathbb{E}_\pi [G_t \mid S_t = s, A_t = a] $$</p>
<p>This is the expected return if the agent starts in state \(s\), takes action \(a\), and then follows its policy \( \pi \) from that point on. The Q-function is often more useful for decision-making because for any state, the agent can simply choose the action with the highest Q-value.</p>
<div class="example-rl">
<p><strong>Example:</strong> You are Pac-Man at an intersection (state s). The Q-function would give you values for each action: \( Q(s, \text{move left}) = +50 \), \( Q(s, \text{move right}) = -200 \) (because a ghost is there). You would obviously choose to move left.</p>
</div>
<h2>🔹 4. Reward vs. Value Function</h2>
<table>
<thead>
<tr>
<th>Aspect</th>
<th>Reward (R)</th>
<th>Value Function (V or Q)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Timing</strong></td>
<td><strong>Immediate</strong> and short-term.</td>
<td><strong>Long-term</strong> prediction of future rewards.</td>
</tr>
<tr>
<td><strong>Source</strong></td>
<td>Provided directly by the <strong>environment</strong>.</td>
<td><strong>Estimated by the agent</strong> based on its experience.</td>
</tr>
<tr>
<td><strong>Purpose</strong></td>
<td>Defines the fundamental goal of the task.</td>
<td>Used to guide the agent's policy toward that goal.</td>
</tr>
<tr>
<td><strong>Analogy</strong></td>
<td>The `+1` point you get for eating a pellet in Pac-Man.</td>
<td>Your internal estimate of the final high score you are likely to get from your current position.</td>
</tr>
</tbody>
</table>
<h2>🔹 5. Examples</h2>
<div class="example-rl">
<h3>Example 1: Chess</h3>
<ul>
<li><strong>Reward:</strong> Sparse. +1 for a win, -1 for a loss, 0 for all other moves.</li>
<li><strong>Value Function:</strong> A high-value state is a board position where you have a strategic advantage (e.g., controlling the center, having more valuable pieces). The agent learns that these states, while not immediately rewarding, are valuable because they lead to a higher probability of winning.</li>
</ul>
</div>
<div class="example-rl">
<h3>Example 2: Self-driving Car</h3>
<ul>
<li><strong>Reward:</strong> A carefully shaped function: +1 for moving forward, -0.1 for jerky movements, -100 for a collision.</li>
<li><strong>Value Function:</strong> A high-value state is one that is "safe" and making progress (e.g., driving in the center of the lane with no obstacles nearby). A low-value state is one that is dangerous (e.g., being too close to the car in front), even if no negative reward has been received yet.</li>
</ul>
</div>
<h2>🔹 6. Challenges</h2>
<ul>
<li><strong>Reward Shaping:</strong> Designing a good reward function is one of the hardest parts of applied RL. A poorly designed reward can lead to unintended "reward hacking."
<div class="example-rl"><p><strong>Example:</strong> An AI agent rewarded for winning a boat race discovered a bug where it could go in circles and collect turbo boosts infinitely, never finishing the race but accumulating a huge score. It maximized the reward signal, but not in the way the designers intended.</p></div>
</li>
<li><strong>Sparse Rewards:</strong> In many real-world problems, rewards are infrequent (like winning a long game). This makes it very difficult for the agent to figure out which of its thousands of actions were actually responsible for the final outcome.</li>
</ul>
</div>
</body>
</html>
{% endblock %}