Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 (48 GB hybrid)
TurboQuant hybrid quantization of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 — 3-bit attention + 2-bit experts at group_size=32 — using TurboQuant-MLX.
This is the 48 GB-RAM variant of the Nemotron-3 Super 120B quantization. The standard 3-bit (~50 GB) needs ~55 GB peak and only fits a 64 GB Mac after raising iogpu.wired_limit_mb. This hybrid keeps attention at 3-bit (where precision matters) and pushes experts to 2-bit (where the bulk of the weights live), dropping peak memory to ~40 GB so the model fits comfortably on a 48 GB or 64 GB Apple Silicon MacBook with headroom for other apps.
Model Details
- Base Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 (hybrid Mamba + Sparse Attention + MoE, 120 B params total, ~12 B active per token)
- Architecture: 88 layers, hybrid override pattern
MEMEMEM*EMEMEMEM*…(M = Mamba, E = MoE, * = Attention) - Experts: 512 routed experts + 1 shared expert, latent MoE with
moe_latent_size = 1024 - Quantization: TurboQuant hybrid (Hadamard rotation + Lloyd-Max codebook)
- Attention (q/k/v/o_proj): 3-bit
- MoE experts and shared expert: 2-bit
- Group size: 32 (per-group scaling)
- Calibration data: none — TurboQuant is data-free
- Size: ~36 GB on disk (vs ~240 GB BF16, ~6.7× smaller; vs the standard tq3 ~50 GB, 28% smaller)
- Peak memory at decode: ~40 GB — fits the default
iogpu.wired_limit_mb=49152(48 GB) on a 64 GB Mac - Runs on: Apple Silicon (M1/M2/M3/M4) with 48 GB or more unified memory
Requirements
pip install "turboquant-mlx-full>=0.1.6" "mlx-lm>=0.31.3"
⚠️ Use
turboquant-mlx-full0.1.6 or newer — earlier versions don't have the per-layer--attn-bits/--mlp-bitsplumbing required to load this hybrid model, and don't have the long-context kernel fix needed for prompts that span more than a few thousand tokens.
Quick Start
Download the model
hf download manjunathshiva/Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 \
--local-dir ~/models/nemotron-3-super-120b-tq3a-tq2e-g32
Generate text — recommended config
For prose, code, format, and long-context tasks, use the empirically-validated decode config (see Phase-1 known limitation below for math/numeric prompts):
python -m turboquant_mlx.generate \
--model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \
--prompt "Why is the sky blue? Explain in detail." \
--max-tokens 4096 --min-tokens 50 \
--temp 0.7 --rep-penalty 1.04 --rep-ctx 256
The --min-tokens 50 flag is required for Nemotron-3 Super — the model emits a <think> reasoning trace before its final answer, and the chat template primes EOS as the top-1 logit at the start of the assistant turn.
The small repetition penalty (--rep-penalty 1.04 --rep-ctx 256) prevents long-form generation from collapsing into degenerate tail loops past ~1500 tokens. Without it, you may see em-dash runs or repeated phrases at the tail of long essays.
From Python (mlx-lm)
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
model, tokenizer = load("manjunathshiva/Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32")
sampler = make_sampler(temp=0.7)
processors = make_logits_processors(repetition_penalty=1.04, repetition_context_size=256)
response = generate(
model, tokenizer,
prompt="Why is the sky blue? Explain in simple terms.",
max_tokens=200,
sampler=sampler,
logits_processors=processors,
)
print(response)
Phase-1 known limitation: math accuracy
Step-by-step arithmetic on this hybrid is degraded under any non-zero --rep-penalty. The 2-bit experts cause small slips in numeric reasoning that the repetition penalty doesn't compensate for. For numeric/math prompts in this Phase-1 release, omit --rep-penalty:
# Math/numeric prompt — omit rep-penalty
python -m turboquant_mlx.generate \
--model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \
--prompt "A train leaves Boston at 9:00 AM going 60 mph..." \
--max-tokens 2048 --min-tokens 50 \
--temp 0.7
The trade-off: without the penalty you may see long-gen tail loops on prompts with very long outputs, but the arithmetic will land correctly more often. For serious numeric work, prefer the standard tq3 model: manjunathshiva/Nemotron-3-Super-120B-A12B-tq3.
A permanent fix is planned for Phase 2 of TurboQuant-MLX — likely first/last-layer bit protection, a calibration-data codebook, or a fused QJL Metal kernel.
Long-context support
The fused MoE decode kernel transparently chunks expert routings on long prompts (K_CHUNK=4096), so this hybrid handles long-context retrieval over 4000+ tokens of context without the kernel argument-validation crash that affected earlier builds.
A 4000+ token "needle in a haystack" prompt (recall a password buried in 2000 words of filler on each side) recovers the password reliably.
Results
Measured on a 64 GB MacBook M-series with macOS, MLX, and turboquant-mlx-full 0.1.6.
| Configuration | Size | Peak RAM | Fits 48 GB? | Speed |
|---|---|---|---|---|
| BF16 (original) | ~240 GB | — | ❌ | n/a |
| TurboQuant 3-bit (standard) | ~50 GB | ~55 GB | ❌ (needs sysctl) | ~19 tok/s |
| TurboQuant hybrid (this repo) | ~36 GB | ~40 GB | ✅ | ~22.5 tok/s |
Stress test summary (sampler-B config: temp=0.7 rep_penalty=1.04 rep_ctx=256)
| Test | Result |
|---|---|
| 1500-word essay (3500-tok budget) | ✅ clean — proper conclusion + references, no degenerate tail |
| Step-by-step math (train-meeting problem) | ⚠️ Phase-1 limitation — final number off |
Python code generation (merge_intervals + 3 unit tests) |
✅ clean |
| Long-context needle (4000-tok password recall) | ✅ password recovered |
| Numbered-list format (5 benefits, ≤15 words each) | ✅ clean — exits <think>, exactly 5 lines |
| Open-ended explanation (4096-tok budget) | ✅ clean — terminates at ~1.5K tokens with proper structure |
How It Works
TurboQuant applies, in one shot with no calibration data:
- Hadamard rotation — a reversible orthogonal transform that flattens weight outliers, so all values land in a narrow range that 2-bit/3-bit quantization can represent without large error.
- Lloyd-Max codebook — optimal scalar values (4 levels at 2-bit, 8 levels at 3-bit) chosen to minimize total quantization error. Codebooks are fixed and embedded in
config.json. - Group-wise scaling — per-group float16 scales (group size 32) preserve per-channel dynamic range. Smaller groups improve per-group fit at the cost of slightly larger storage.
- Hybrid bit allocation — attention precision matters more for next-token coherence; experts dominate storage. Splitting attention to 3-bit and experts to 2-bit recovers most of the standard-tq3 quality at ~28% smaller size.
- Latent-MoE quantization — Nemotron-3 Super's 512 experts share a 1024-dim latent space. Quantizing that shared space compresses every expert at once.
Architecture Notes
Nemotron-3 Super is a hybrid model — the layer pattern alternates between:
- M — Mamba state-space layers (cheap for long context)
- E — Mixture-of-Experts (512 routed experts, latent-MoE design)
- * — Sparse attention (used only where it helps)
Plus 1 MTP (multi-token-prediction) layer. There are no dense MLP layers — all FFN compute goes through the MoE.
turboquant-mlx-full 0.1.6 quantizes:
- Mamba
in_proj/out_projlinears (2-bit) - Attention QKV / O linears (3-bit — the hybrid distinction)
- The latent-MoE projections (
fc1_latent_proj,fc2_latent_proj) — the shared expert pantry (2-bit) - The shared expert and MTP layer linears (2-bit)
Embeddings, layer norms, and small bias-style tensors stay in BF16 / FP16.
Roadmap
- Phase 1 (this release, v0.1.6) — Hybrid quantization for 48 GB target + long-context kernel fix + recommended decode config. Math accuracy at long generation is a known limitation.
- Phase 2 (planned) — Permanent math accuracy fix. Candidates under evaluation: first/last-layer bit protection (architectural prior), calibration-data Lloyd-Max codebook (algorithmic), or a fused QJL Metal kernel (kernel-level).
License
Released under the NVIDIA Nemotron Open Model License (same as the base model). See https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
Acknowledgements
- NVIDIA — for releasing Nemotron-3-Super-120B-A12B-BF16 openly
- Google Research — for the original TurboQuant algorithm
- Apple — for MLX and the unified-memory architecture that makes this fit
mlx-lmmaintainers — for landing Nemotron-H + latent-MoE + MTP support in 0.31.3
Citation
@article{zandieh2025turboquant,
title = {TurboQuant: A Unified Framework for Extremely Low-Bit Weight and KV Cache Quantization},
author = {Zandieh, Amir and Han, Minsik and Dalca, Andre and Shin, Jungwoo and Wang, Brian and Zhang, Yichao and Bordegoni, Matteo and Tian, Yuan and others},
year = {2025}
}
Repository
- TurboQuant-MLX (the conversion tool): https://github.com/manjunathshiva/turboquant-mlx
- Standard tq3 variant (~50 GB, needs sysctl bump on 64 GB):
manjunathshiva/Nemotron-3-Super-120B-A12B-tq3 - Issues / questions: https://github.com/manjunathshiva/turboquant-mlx/issues
- Downloads last month
- 59
2-bit