Qwen3.5-122B-A10B — PrismaQuant 4.76 bpp

PrismaQuant source License: Apache-2.0 vLLM native

Mixed-precision quantization of Qwen/Qwen3.5-122B-A10B produced by PrismaQuant — a per-Linear sensitivity-driven allocator that chooses each Linear module's format individually under a total-bit budget.

Why "every layer refracts into a different format": a naive uniform NVFP4 either leaves disk on the table (keeping everything BF16 "to be safe") or loses quality (quantizing sensitive layers to 4-bit). PrismaQuant measures the actual Fisher-weighted MSE for every (Linear, format) pair and runs a multi-choice knapsack under a total-bit budget, so every bit lives where it buys the most likelihood.


At a glance

Metric BF16 source This artifact Delta
Size on disk 244 GB 72 GB −70 %
Fraction of original weights 100 % 29.5 %
Average bits per param 16 4.76
Multimodal (vision + text)
MTP speculative decoding heads
Loads in vLLM (stock compressed-tensors)
Runtime backend any vLLM only

Precision mix

This checkpoint uses three precisions, selected per-Linear by the allocator from measured sensitivity — not chosen uniformly:

Format W A Use Count
NVFP4 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) 4-bit (dynamic) Bulk MoE experts + medium-sensitivity dense Linears + full visual encoder 72 dense + 96 per-expert + 2 MTP per-expert + visual NVFP4s = 170+
MXFP8 8-bit (E4M3, group_size=32 with per-group E8M0 scale) 8-bit (dynamic) High-sensitivity dense Linears the allocator won't risk at 4-bit 12 Linears
BF16 16-bit 16-bit Router, norms, biases, embed / lm_head, pos_embed, layer passthrough 404 entries

The allocator couples MoE gate_up_proj / down_proj so siblings share a scheme (vLLM's FusedMoE requires this), and fused attention siblings (q_proj/k_proj/v_proj) share one per-tensor global scale so the packed qkv_proj loads without the "accuracy mismatch" warning.

Activation-aware passes applied during export

On every NVFP4 weight the exporter runs, in order:

  1. GPTQ-OBS one-shot rounding — block-wise error propagation along the group-quant structure using the calibration Hessian. Closed-form, not iterative. Handles cross-column activation coupling.
  2. Closed-form per-group scale sweep — for each 16-weight NVFP4 group, enumerate grid=32 candidate scales spanning [0.5·s₀, 1.5·s₀], round each weight to its nearest codebook neighbor at every candidate scale, pick the (scale, rounding-set) configuration minimizing activation-weighted per-group MSE sum_j a_j² · (w_orig,j - w_q,j)². Improve-or-keep gate against the post-GPTQ weight. Row-chunked to keep peak memory <2 GB regardless of Linear shape.

Scale_sweep is the closed-form analog of Intel's AutoRound — where AutoRound learns per-weight continuous rounding offsets V via 200 SGD iterations on a relaxed loss, scale_sweep enumerates the discrete scale dimension directly and lets RTN pick rounding conditional on scale. No gradient descent, sub-second per Linear after the row-chunked fix.

Measured per-Linear output-MSE vs RTN baseline (Qwen3.6-35B, mixed visual + MTP Linears, geomean — 122B shape class is similar):

Pipeline variant out_mse ratio vs RTN
RTN (no passes) 1.00
GPTQ only 0.41
GPTQ + act_round polish (prior pipeline) 0.99 (act_round undid GPTQ)
scale_sweep only 0.33
GPTQ + scale_sweep (this artifact) 0.33

The prior pipeline's act_round polish turned out to systematically undo GPTQ's cross-column error propagation — its per-weight metric minima don't respect GPTQ's compensation structure. scale_sweep replaces it as a strict improvement.

AWQ's γ-fold is not applied. On NVFP4's 16-channel groups, AWQ's per-channel rescaling pushes mixed-scale values into the same group and inflates per-group quant noise rather than reducing it.


Which layers are quantized

Text body (DeltaNet linear-attention + dense MoE, 48 layers)

  • Full attention Linears (q_proj, k_proj, v_proj, o_proj): mixed NVFP4 / MXFP8 / BF16 per-Linear by sensitivity
  • DeltaNet linear-attention Linears (in_proj_qkv, in_proj_z, in_proj_a, in_proj_b, out_proj): same
  • MoE experts (gate_up_proj, down_proj, 64 experts per MoE layer): per-expert NVFP4 with joint per-tensor scale across the gate_up pair so vLLM FusedMoE loads them
  • Shared expert MLP: same per-Linear policy
  • Router (mlp.gate): always BF16 (tiny, sensitive)

Multi-token-prediction (MTP) head

  • Speculative-decoding head (1 layer) + its own MoE block: same per-Linear policy, so --speculative-config method=mtp drafts at the same precision as the body.

Visual encoder (27 blocks — Qwen3.5-VL vision tower)

  • Fisher-driven per-Linear allocation: 108 of 110 visual Linears got placed by the full DP allocator on the basis of per-Linear activation-weighted cost (8 multimodal calibration samples, 110 Linears tracked via the model.visual.* module tree).
  • Remaining 2 un-probed visual Linears (patch_embed.proj edges the probe didn't tap) stamped at NVFP4 uniformly.
  • model.visual.pos_embed stays BF16 — it's a learnable parameter, not an nn.Linear, and vLLM's compressed-tensors loader cannot consume a quantized Parameter layout. The allocator's discover pass excludes it explicitly.
  • This is the same treatment body Linears get. There is ONE incremental code path: the streaming multimodal probe keeps the visual tower (~2 GB) resident on GPU while it streams the 244 GB body layer-by-layer, capturing Fisher through inputs_embeds.backward(grad) that propagates into visual weights.

Passthrough (unquantized)

  • lm_head — kept at BF16 because vLLM's ParallelLMHead module only accepts a single weight parameter. The allocator measures lm_head's Fisher sensitivity and would pick NVFP4 for it (saving ~1.1 GB), but the compressed-tensors runtime rejects a compressed lm_head with KeyError: lm_head.input_global_scale because its scheme registry doesn't include ParallelLMHead. This is a vLLM runtime limitation, not a PrismaQuant design decision.
  • RMSNorm weights (all layers + MTP + visual)
  • All biases
  • embed_tokens
  • model.visual.pos_embed (Parameter/Embedding, see above)

Serving (vLLM only)

This artifact is only runnable via vLLM's stock compressed-tensors support — there is no transformers-native runtime path for mixed NVFP4 + MXFP8 with packed-MoE experts today. vLLM 0.11+ or equivalent is required.

vllm serve rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  • FlashInfer NVFP4 attention is picked up automatically; set VLLM_USE_FLASHINFER_NVFP4=1 to make the preference explicit.
  • MTP speculative decoding at n=3 is the measured optimum for this family on DGX Spark (n=2 leaves ~10 % tok/s on the table, n=4 regresses).
  • Visual inputs work via vLLM's standard image-text-to-text chat API — no special flags.

Reproducing this artifact

Full pipeline is in the PrismaQuant repo:

  1. Sensitivity probe — streaming per-shard empirical-Fisher trace (diagonal) across body + MTP + visual Linears. Each shard holds only its ~2 layers resident; the rest of the model is on disk or meta. 8 multimodal calibration samples drive visual Fisher through one unified streaming context.
  2. Per-(Linear, format) cost measurement — for each Linear and each candidate format, the per-group RTN error weighted by cached input activations. Incremental: same per-shard streaming as the probe.
  3. Multi-choice knapsack allocator — picks one format per Linear minimizing total predicted Δloss under the bit budget. Target 4.75 bpp; achieved 4.758 bpp here. Known-non-Linear rank-2 tensors (pos_embed, rotary_emb) are excluded from the visual pool.
  4. Export — streams each body / visual / MTP shard, applies GPTQ + activation-weighted rounding to its NVFP4 entries, writes the compressed-tensors format. lm_head passthrough at BF16 enforced at this stage (see known issues).

Wall-clock on a DGX Spark (128 GB unified memory): ~1 hour on cached probe + cost + activation shards (body shards are invariant across export-pass flag changes, so only the final export stage reruns when you change a flag).


Known issues / limitations

  • vLLM only at serve time. No transformers-runtime path for this precision mix today.
  • lm_head stays BF16 because vLLM's ParallelLMHead does not register the NVFP4/MXFP8 compressed-tensors schemes. Allocator measured it and would have picked NVFP4; the runtime limitation forces BF16. Costs ~1.1 GB on the disk footprint.
  • MTP n=4 regresses on this family. Stick to n=3 unless you verify against the draft-head acceptance-rate trace.
  • Peak VRAM residency on DGX Spark (unified memory) is ~86 GB with FP8 KV cache at 32 k context; tune --gpu-memory-utilization and --max-model-len if the machine is shared.

Links

Citation

@software{prismaquant2026,
  title        = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
                  quantization for LLMs},
  author       = {Tand, Rob},
  year         = 2026,
  url          = {https://github.com/RobTand/prismaquant},
}
Downloads last month
5,339
Safetensors
Model size
72B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm

Quantized
(105)
this model