Qwen3.6-35B-A3B — PrismaQuant 4.9 bpp
Mixed-precision quantization of Qwen/Qwen3.6-35B-A3B produced by
PrismaQuant — a per-Linear
sensitivity-driven allocator that chooses each Linear module's format
individually under a total-bit budget.
Why "every layer refracts into a different format": a naive uniform NVFP4 either leaves disk on the table (keeping everything BF16 "to be safe") or loses quality (quantizing sensitive layers to 4-bit). PrismaQuant measures the actual Fisher-weighted MSE for every (Linear, format) pair and runs a multi-choice knapsack under a total-bit budget, so every bit lives where it buys the most likelihood.
At a glance
| Metric | BF16 source | This artifact | Delta |
|---|---|---|---|
| Size on disk | 70 GB | 22 GB | −69 % |
| Fraction of original weights | 100 % | 31 % | |
| Average bits per param | 16 | 4.907 | |
| Multimodal (vision + text) | ✓ | ✓ | |
| MTP speculative decoding heads | ✓ | ✓ | |
Loads in vLLM (stock compressed-tensors) |
✓ | ✓ | |
| Runtime backend | any | vLLM only |
Precision mix
This checkpoint uses three precisions selected by PrismaQuant's allocator and exporter:
- Allocator result at
target=4.9: achieved4.907across costed Linears- BF16: 294 layers
- NVFP4: 90 layers
- MXFP8_E4M3: 17 layers
- Export recipe over all emitted entries:
- BF16: 404
- NVFP4: 90
- MXFP8: 17
The allocator also produced a Pareto curve and
reported a suggested knee at target=5.0, achieved 4.971, with
predicted Δloss=1.541e+02.
| target | achieved | Δloss (pred) | NVFP4 | MXFP8_E4M3 | BF16 |
|---|---|---|---|---|---|
| 4.500 | 4.613 | 5.5531e+02 | 183 | 1 | 217 |
| 4.600 | 4.636 | 4.3781e+02 | 170 | 10 | 221 |
| 4.700 | 4.707 | 3.0143e+02 | 136 | 28 | 237 |
| 4.750 | 4.755 | 2.5822e+02 | 119 | 34 | 248 |
| 4.850 | 4.860 | 1.9177e+02 | 98 | 24 | 279 |
| 5.000 | 4.971 | 1.5405e+02 | 84 | 0 | 317 |
| 5.250 | 5.141 | 1.3768e+02 | 82 | 0 | 319 |
| 5.500 | 5.406 | 1.3168e+02 | 80 | 0 | 321 |
| 6.000 | 5.935 | 1.2179e+02 | 76 | 0 | 325 |
| 7.000 | 6.995 | 9.8870e+01 | 68 | 0 | 333 |
| 8.250 | 8.584 | 7.0943e+01 | 56 | 0 | 345 |
The allocator couples MoE gate_up_proj / down_proj so siblings share
a scheme (vLLM's FusedMoE requires this), and fused attention siblings
(q_proj/k_proj/v_proj) share one per-tensor global scale so the
packed qkv_proj loads without the "accuracy mismatch" warning.
Activation-aware passes applied during export
On every NVFP4 weight the exporter runs, in order:
- GPTQ-OBS one-shot rounding — block-wise error propagation along the group-quant structure using the calibration Hessian. Closed-form, not iterative. Handles cross-column activation coupling.
- Closed-form per-group scale sweep — for each 16-weight NVFP4
group, enumerate
grid=32candidate scales spanning[0.5·s₀, 1.5·s₀], round each weight to its nearest codebook neighbor at every candidate scale, pick the (scale, rounding-set) configuration minimizing activation-weighted per-group MSEsum_j a_j² · (w_orig,j - w_q,j)². Improve-or-keep gate against the post-GPTQ weight. Handles within-group weight-distribution variation that GPTQ takes as fixed.
Scale_sweep is the closed-form analog of Intel's AutoRound — where AutoRound learns per-weight continuous rounding offsets V via 200 SGD iterations on a relaxed loss, scale_sweep enumerates the discrete scale dimension directly and lets RTN pick rounding conditional on scale. No gradient descent, sub-second per Linear.
Measured per-Linear output-MSE vs RTN baseline (Qwen3.6-35B, mixed visual + MTP Linears, geomean):
| Pipeline variant | out_mse ratio vs RTN |
|---|---|
| RTN (no passes) | 1.00 |
| GPTQ only | 0.41 |
| GPTQ + act_round polish (prior pipeline) | 0.99 (act_round undid GPTQ) |
| scale_sweep only | 0.33 |
| GPTQ + scale_sweep (this artifact) | 0.33 |
The prior pipeline's act_round polish was a closed-form per-weight
Δw²·E[a²] minimization at the fixed group scale. It turned out to
systematically undo GPTQ's cross-column error propagation — the
per-weight metric minima don't respect GPTQ's compensation structure.
scale_sweep replaces it as a strict improvement.
AWQ's γ-fold is not applied. On NVFP4's 16-channel groups, AWQ's per-channel rescaling pushes mixed-scale values into the same group and inflates per-group quant noise rather than reducing it (measured: baseline PPL 4.97, AWQ-only 16.44 — +230 %).
Which layers are quantized
Text body (DeltaNet linear-attention + dense MoE, 40 layers)
- Full attention Linears (
q_proj,k_proj,v_proj,o_proj): mixed NVFP4 / MXFP8 / BF16 per-Linear by sensitivity - DeltaNet linear-attention Linears (
in_proj_qkv,in_proj_z,in_proj_a,in_proj_b,out_proj): same - MoE experts (
gate_up_proj,down_proj): per-expert NVFP4 with joint per-tensor scale across thegate_uppair so vLLM FusedMoE loads them - Shared expert MLP: same per-Linear policy
- Router (
mlp.gate): always BF16 (tiny, sensitive)
Multi-token-prediction (MTP) head
- Speculative-decoding head (1 layer) + its own MoE block: same
per-Linear policy, so
--speculative-config method=mtpdrafts at the same precision as the body.
Visual encoder (27 blocks — Qwen3.6-VL vision tower)
- Visual Linears were forced with
--visual-format=BF16. - 110 visual Linears were assigned BF16 uniformly.
model.visual.pos_embedremains BF16 passthrough as a Parameter.
Passthrough (unquantized)
lm_head— kept at BF16 because vLLM'sParallelLMHeadmodule only accepts a singleweightparameter. The allocator measures lm_head's Fisher sensitivity and would pick NVFP4 for it (saving ~770 MB), but the compressed-tensors runtime rejects a compressed lm_head withKeyError: lm_head.input_global_scalebecause its scheme registry doesn't includeParallelLMHead. This is a vLLM runtime limitation, not a PrismaQuant design decision.- RMSNorm weights (all layers + MTP + visual)
- All biases
embed_tokensmodel.visual.pos_embed(Parameter/Embedding, see above)
Serving (vLLM only)
This artifact is only runnable via vLLM's stock compressed-tensors
support — there is no transformers-native runtime path for mixed NVFP4 +
MXFP8 with packed-MoE experts today. vLLM 0.11+ or equivalent is
required.
vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
- FlashInfer NVFP4 attention is picked up automatically; set
VLLM_USE_FLASHINFER_NVFP4=1to make the preference explicit. - MTP speculative decoding at
n=3is the measured optimum for this family on DGX Spark (n=2 leaves ~10 % tok/s on the table, n=4 regresses). - Visual inputs work via vLLM's standard
image-text-to-textchat API — no special flags.
Reproducing this artifact
Full pipeline is in the PrismaQuant repo:
- Sensitivity probe — streaming per-shard empirical-Fisher trace (diagonal) across body + MTP + visual Linears. Each shard holds only its ~2 layers resident; the rest of the model is on disk or meta. 8 multimodal calibration samples drive visual Fisher through one unified streaming context.
- Per-(Linear, format) cost measurement — for each Linear and each candidate format, the per-group RTN error weighted by cached input activations. Incremental: same per-shard streaming as the probe.
- Multi-choice knapsack allocator — picks one format per Linear
minimizing total predicted Δloss under the bit budget. Target 4.9
bpp; achieved 4.907 bpp here. The same run reported a knee near
target 5.0 (achieved 4.971). Known-non-Linear rank-2 tensors
(
pos_embed,rotary_emb) are excluded from the visual pool. - Export — streams each body / visual / MTP shard, applies GPTQ +
scale_sweep to NVFP4 entries, and writes compressed-tensors shards.
Export recipe for this run: 511 entries with mix
{NVFP4: 90, BF16: 404, MXFP8: 17}.lm_headpassthrough at BF16 is enforced at this stage (see known issues).
Wall-clock on a DGX Spark (128 GB unified memory): ~15 min on cached probe + cost + activation shards (body shards are invariant across export-pass flag changes, so only the final export stage reruns when you change a flag).
Known issues / limitations
- vLLM only at serve time. No transformers-runtime path for this precision mix today.
- lm_head stays BF16 because vLLM's
ParallelLMHeaddoes not register the NVFP4/MXFP8 compressed-tensors schemes. Allocator measured it and would have picked NVFP4; the runtime limitation forces BF16. Costs ~770 MB on the disk footprint. - MTP n=4 regresses on this family. Stick to
n=3unless you verify against the draft-head acceptance-rate trace. - Export fast path was unavailable in this run. The exporter fell back to the torch implementation because flash-linear-attention and/or causal-conv1d dependencies were missing.
- PyTorch CUDA capability warning on GB10. The environment printed a warning that GPU capability 12.1 is outside the declared max (12.0) for that specific torch build.
Links
- Source: github.com/RobTand/prismaquant
- Base model: Qwen/Qwen3.6-35B-A3B
Citation
@software{prismaquant2026,
title = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
quantization for LLMs},
author = {Tand, Rob},
year = 2026,
url = {https://github.com/RobTand/prismaquant},
}
- Downloads last month
- 569
Model tree for cyburn/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm
Base model
Qwen/Qwen3.6-35B-A3B