Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
Open to Work
3
3
19
Tarun Reddi
PRO
Teen-Different
Follow
branikita's profile picture
blazej-mako's profile picture
charvi020's profile picture
4 followers
·
17 following
https://redditarun.github.io/
_TeenDifferent
REDDITARUN
tarunreddi
AI & ML interests
Generative AI, Modular AI Systems, Reinforcement Learning
Recent Activity
posted
an
update
3 days ago
Adaptive Attention at Inference Time: Does It Actually Work? A hypernetwork that rewires GPT's value heads on every forward pass. The answer: not a clean win — but not a failure either. Blog post: https://teendifferent.substack.com/p/adaptive-attention-at-inference-time Code: https://github.com/REDDITARUN/a-gpt Weights: https://huggingface.co/Teen-Different/adaptive-gpts What This Is Five small language model variants trained for 12k steps on a 300M token mixed corpus, answering one question: can the residual stream be used to slightly rewrite the model's own computation while it's running? Instead of a fixed W_v for every context, a TinyHeadTransformer hypernetwork generates low-rank (LoRA-style) updates to the value projection of each attention head — conditioned on the current residual stream. Each token gets a dynamically adapted value transformation. The Five Models Base GPT — 28.9M params, 139 tok/s, val loss ~3.82 Matched GPT (+2 layers) — 30.5M params, 204 tok/s, val loss ~3.80 Adaptive GPT — 30.5M params, 38.7 tok/s, val loss ~3.88–3.92 Diffusion GPT — 28.9M params, 110 tok/s, val loss ~5.0–5.2 Adaptive Diffusion GPT — 30.5M params, 40.4 tok/s, val loss ~5.0–5.2 Architecture: 4 layers, 4 heads, d_model=256, context=256, RoPE, GPT-2 tokenizer. How the Hypernetwork Works For each attention head, a TinyHeadTransformer encodes the head's residual stream slice, mean-pools it to a conditioning vector, then projects into low-rank factors A (d×r) and B (r×d) at rank=8. The dynamic value update follows LoRA conventions with alpha/r scaling. B is zero-initialized so the adaptive path starts inert and the model begins as a vanilla GPT — critical for training stability. The diffusion variant uses bidirectional attention, RMSNorm, squared ReLU, and a learned timestep embedding.
updated
a model
3 days ago
Teen-Different/adaptive-gpts
published
a model
4 days ago
Teen-Different/adaptive-gpts
View all activity
Organizations
Teen-Different
's datasets
13
Sort: Recently updated
Teen-Different/arc-agi-2-encoded-sft
Viewer
•
Updated
15 days ago
•
112k
•
5
Teen-Different/arc-agi-1-encoded-sft
Viewer
•
Updated
15 days ago
•
106k
•
9
Teen-Different/arc-agi-2-decoded-sft
Viewer
•
Updated
15 days ago
•
112k
•
7
Teen-Different/arc-agi-1-decoded-sft
Viewer
•
Updated
15 days ago
•
106k
•
5
Teen-Different/latex-handwritten-sft
Updated
Dec 23, 2025
•
11
Teen-Different/grpo-oumi-anli-subset
Viewer
•
Updated
Apr 25, 2025
•
21.1k
•
37
Teen-Different/grpo-oumi-synthetic-claims
Viewer
•
Updated
Apr 24, 2025
•
19.2k
•
12
Teen-Different/grpo-oumi-c2d-d2c-subset
Viewer
•
Updated
Apr 24, 2025
•
14.4k
•
25
Teen-Different/grpo-oumi-synthetic-document-claims
Viewer
•
Updated
Apr 24, 2025
•
8.4k
•
28
Teen-Different/Facial-Expression
Preview
•
Updated
Apr 6, 2025
•
27
Teen-Different/Food-Ingredient
Viewer
•
Updated
Mar 30, 2025
•
5k
•
31
Teen-Different/Code_Opt_Triton_Shuffled
Viewer
•
Updated
Mar 27, 2025
•
36.3k
•
19
•
1
Teen-Different/Code_Opt_Triton
Viewer
•
Updated
Mar 27, 2025
•
36.3k
•
54
•
1