Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 (48 GB hybrid)

TurboQuant hybrid quantization of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 — 3-bit attention + 2-bit experts at group_size=32 — using TurboQuant-MLX.

This is the 48 GB-RAM variant of the Nemotron-3 Super 120B quantization. The standard 3-bit (~50 GB) needs ~55 GB peak and only fits a 64 GB Mac after raising iogpu.wired_limit_mb. This hybrid keeps attention at 3-bit (where precision matters) and pushes experts to 2-bit (where the bulk of the weights live), dropping peak memory to ~40 GB so the model fits comfortably on a 48 GB or 64 GB Apple Silicon MacBook with headroom for other apps.

Model Details

  • Base Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 (hybrid Mamba + Sparse Attention + MoE, 120 B params total, ~12 B active per token)
  • Architecture: 88 layers, hybrid override pattern MEMEMEM*EMEMEMEM*… (M = Mamba, E = MoE, * = Attention)
  • Experts: 512 routed experts + 1 shared expert, latent MoE with moe_latent_size = 1024
  • Quantization: TurboQuant hybrid (Hadamard rotation + Lloyd-Max codebook)
    • Attention (q/k/v/o_proj): 3-bit
    • MoE experts and shared expert: 2-bit
    • Group size: 32 (per-group scaling)
  • Calibration data: none — TurboQuant is data-free
  • Size: ~36 GB on disk (vs ~240 GB BF16, ~6.7× smaller; vs the standard tq3 ~50 GB, 28% smaller)
  • Peak memory at decode: ~40 GB — fits the default iogpu.wired_limit_mb=49152 (48 GB) on a 64 GB Mac
  • Runs on: Apple Silicon (M1/M2/M3/M4) with 48 GB or more unified memory

Requirements

pip install "turboquant-mlx-full>=0.1.6" "mlx-lm>=0.31.3"

⚠️ Use turboquant-mlx-full 0.1.6 or newer — earlier versions don't have the per-layer --attn-bits / --mlp-bits plumbing required to load this hybrid model, and don't have the long-context kernel fix needed for prompts that span more than a few thousand tokens.

Quick Start

Download the model

hf download manjunathshiva/Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32 \
    --local-dir ~/models/nemotron-3-super-120b-tq3a-tq2e-g32

Generate text — recommended config

For prose, code, format, and long-context tasks, use the empirically-validated decode config (see Phase-1 known limitation below for math/numeric prompts):

python -m turboquant_mlx.generate \
    --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \
    --prompt "Why is the sky blue? Explain in detail." \
    --max-tokens 4096 --min-tokens 50 \
    --temp 0.7 --rep-penalty 1.04 --rep-ctx 256

The --min-tokens 50 flag is required for Nemotron-3 Super — the model emits a <think> reasoning trace before its final answer, and the chat template primes EOS as the top-1 logit at the start of the assistant turn.

The small repetition penalty (--rep-penalty 1.04 --rep-ctx 256) prevents long-form generation from collapsing into degenerate tail loops past ~1500 tokens. Without it, you may see em-dash runs or repeated phrases at the tail of long essays.

From Python (mlx-lm)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("manjunathshiva/Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32")

sampler = make_sampler(temp=0.7)
processors = make_logits_processors(repetition_penalty=1.04, repetition_context_size=256)

response = generate(
    model, tokenizer,
    prompt="Why is the sky blue? Explain in simple terms.",
    max_tokens=200,
    sampler=sampler,
    logits_processors=processors,
)
print(response)

Phase-1 known limitation: math accuracy

Step-by-step arithmetic on this hybrid is degraded under any non-zero --rep-penalty. The 2-bit experts cause small slips in numeric reasoning that the repetition penalty doesn't compensate for. For numeric/math prompts in this Phase-1 release, omit --rep-penalty:

# Math/numeric prompt — omit rep-penalty
python -m turboquant_mlx.generate \
    --model ~/models/nemotron-3-super-120b-tq3a-tq2e-g32 \
    --prompt "A train leaves Boston at 9:00 AM going 60 mph..." \
    --max-tokens 2048 --min-tokens 50 \
    --temp 0.7

The trade-off: without the penalty you may see long-gen tail loops on prompts with very long outputs, but the arithmetic will land correctly more often. For serious numeric work, prefer the standard tq3 model: manjunathshiva/Nemotron-3-Super-120B-A12B-tq3.

A permanent fix is planned for Phase 2 of TurboQuant-MLX — likely first/last-layer bit protection, a calibration-data codebook, or a fused QJL Metal kernel.

Long-context support

The fused MoE decode kernel transparently chunks expert routings on long prompts (K_CHUNK=4096), so this hybrid handles long-context retrieval over 4000+ tokens of context without the kernel argument-validation crash that affected earlier builds.

A 4000+ token "needle in a haystack" prompt (recall a password buried in 2000 words of filler on each side) recovers the password reliably.

Results

Measured on a 64 GB MacBook M-series with macOS, MLX, and turboquant-mlx-full 0.1.6.

Configuration Size Peak RAM Fits 48 GB? Speed
BF16 (original) ~240 GB n/a
TurboQuant 3-bit (standard) ~50 GB ~55 GB ❌ (needs sysctl) ~19 tok/s
TurboQuant hybrid (this repo) ~36 GB ~40 GB ~22.5 tok/s

Stress test summary (sampler-B config: temp=0.7 rep_penalty=1.04 rep_ctx=256)

Test Result
1500-word essay (3500-tok budget) ✅ clean — proper conclusion + references, no degenerate tail
Step-by-step math (train-meeting problem) ⚠️ Phase-1 limitation — final number off
Python code generation (merge_intervals + 3 unit tests) ✅ clean
Long-context needle (4000-tok password recall) ✅ password recovered
Numbered-list format (5 benefits, ≤15 words each) ✅ clean — exits <think>, exactly 5 lines
Open-ended explanation (4096-tok budget) ✅ clean — terminates at ~1.5K tokens with proper structure

How It Works

TurboQuant applies, in one shot with no calibration data:

  1. Hadamard rotation — a reversible orthogonal transform that flattens weight outliers, so all values land in a narrow range that 2-bit/3-bit quantization can represent without large error.
  2. Lloyd-Max codebook — optimal scalar values (4 levels at 2-bit, 8 levels at 3-bit) chosen to minimize total quantization error. Codebooks are fixed and embedded in config.json.
  3. Group-wise scaling — per-group float16 scales (group size 32) preserve per-channel dynamic range. Smaller groups improve per-group fit at the cost of slightly larger storage.
  4. Hybrid bit allocation — attention precision matters more for next-token coherence; experts dominate storage. Splitting attention to 3-bit and experts to 2-bit recovers most of the standard-tq3 quality at ~28% smaller size.
  5. Latent-MoE quantization — Nemotron-3 Super's 512 experts share a 1024-dim latent space. Quantizing that shared space compresses every expert at once.

Architecture Notes

Nemotron-3 Super is a hybrid model — the layer pattern alternates between:

  • M — Mamba state-space layers (cheap for long context)
  • E — Mixture-of-Experts (512 routed experts, latent-MoE design)
  • * — Sparse attention (used only where it helps)

Plus 1 MTP (multi-token-prediction) layer. There are no dense MLP layers — all FFN compute goes through the MoE.

turboquant-mlx-full 0.1.6 quantizes:

  • Mamba in_proj / out_proj linears (2-bit)
  • Attention QKV / O linears (3-bit — the hybrid distinction)
  • The latent-MoE projections (fc1_latent_proj, fc2_latent_proj) — the shared expert pantry (2-bit)
  • The shared expert and MTP layer linears (2-bit)

Embeddings, layer norms, and small bias-style tensors stay in BF16 / FP16.

Roadmap

  • Phase 1 (this release, v0.1.6) — Hybrid quantization for 48 GB target + long-context kernel fix + recommended decode config. Math accuracy at long generation is a known limitation.
  • Phase 2 (planned) — Permanent math accuracy fix. Candidates under evaluation: first/last-layer bit protection (architectural prior), calibration-data Lloyd-Max codebook (algorithmic), or a fused QJL Metal kernel (kernel-level).

License

Released under the NVIDIA Nemotron Open Model License (same as the base model). See https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/

Acknowledgements

  • NVIDIA — for releasing Nemotron-3-Super-120B-A12B-BF16 openly
  • Google Research — for the original TurboQuant algorithm
  • Apple — for MLX and the unified-memory architecture that makes this fit
  • mlx-lm maintainers — for landing Nemotron-H + latent-MoE + MTP support in 0.31.3

Citation

@article{zandieh2025turboquant,
  title  = {TurboQuant: A Unified Framework for Extremely Low-Bit Weight and KV Cache Quantization},
  author = {Zandieh, Amir and Han, Minsik and Dalca, Andre and Shin, Jungwoo and Wang, Brian and Zhang, Yichao and Bordegoni, Matteo and Tian, Yuan and others},
  year   = {2025}
}

Repository

Downloads last month
59
Safetensors
Model size
121B params
Tensor type
BF16
·
F16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manjunathshiva/Nemotron-3-Super-120B-A12B-tq3a-tq2e-g32

Quantized
(44)
this model