Qwen3.6-35B-A3B — PrismaQuant 4.9 bpp

PrismaQuant source License: Apache-2.0 vLLM native

Mixed-precision quantization of Qwen/Qwen3.6-35B-A3B produced by PrismaQuant — a per-Linear sensitivity-driven allocator that chooses each Linear module's format individually under a total-bit budget.

Why "every layer refracts into a different format": a naive uniform NVFP4 either leaves disk on the table (keeping everything BF16 "to be safe") or loses quality (quantizing sensitive layers to 4-bit). PrismaQuant measures the actual Fisher-weighted MSE for every (Linear, format) pair and runs a multi-choice knapsack under a total-bit budget, so every bit lives where it buys the most likelihood.


At a glance

Metric BF16 source This artifact Delta
Size on disk 70 GB 22 GB −69 %
Fraction of original weights 100 % 31 %
Average bits per param 16 4.907
Multimodal (vision + text)
MTP speculative decoding heads
Loads in vLLM (stock compressed-tensors)
Runtime backend any vLLM only

Precision mix

This checkpoint uses three precisions selected by PrismaQuant's allocator and exporter:

  • Allocator result at target=4.9: achieved 4.907 across costed Linears
    • BF16: 294 layers
    • NVFP4: 90 layers
    • MXFP8_E4M3: 17 layers
  • Export recipe over all emitted entries:
    • BF16: 404
    • NVFP4: 90
    • MXFP8: 17

The allocator also produced a Pareto curve and reported a suggested knee at target=5.0, achieved 4.971, with predicted Δloss=1.541e+02.

target achieved Δloss (pred) NVFP4 MXFP8_E4M3 BF16
4.500 4.613 5.5531e+02 183 1 217
4.600 4.636 4.3781e+02 170 10 221
4.700 4.707 3.0143e+02 136 28 237
4.750 4.755 2.5822e+02 119 34 248
4.850 4.860 1.9177e+02 98 24 279
5.000 4.971 1.5405e+02 84 0 317
5.250 5.141 1.3768e+02 82 0 319
5.500 5.406 1.3168e+02 80 0 321
6.000 5.935 1.2179e+02 76 0 325
7.000 6.995 9.8870e+01 68 0 333
8.250 8.584 7.0943e+01 56 0 345

The allocator couples MoE gate_up_proj / down_proj so siblings share a scheme (vLLM's FusedMoE requires this), and fused attention siblings (q_proj/k_proj/v_proj) share one per-tensor global scale so the packed qkv_proj loads without the "accuracy mismatch" warning.

Activation-aware passes applied during export

On every NVFP4 weight the exporter runs, in order:

  1. GPTQ-OBS one-shot rounding — block-wise error propagation along the group-quant structure using the calibration Hessian. Closed-form, not iterative. Handles cross-column activation coupling.
  2. Closed-form per-group scale sweep — for each 16-weight NVFP4 group, enumerate grid=32 candidate scales spanning [0.5·s₀, 1.5·s₀], round each weight to its nearest codebook neighbor at every candidate scale, pick the (scale, rounding-set) configuration minimizing activation-weighted per-group MSE sum_j a_j² · (w_orig,j - w_q,j)². Improve-or-keep gate against the post-GPTQ weight. Handles within-group weight-distribution variation that GPTQ takes as fixed.

Scale_sweep is the closed-form analog of Intel's AutoRound — where AutoRound learns per-weight continuous rounding offsets V via 200 SGD iterations on a relaxed loss, scale_sweep enumerates the discrete scale dimension directly and lets RTN pick rounding conditional on scale. No gradient descent, sub-second per Linear.

Measured per-Linear output-MSE vs RTN baseline (Qwen3.6-35B, mixed visual + MTP Linears, geomean):

Pipeline variant out_mse ratio vs RTN
RTN (no passes) 1.00
GPTQ only 0.41
GPTQ + act_round polish (prior pipeline) 0.99 (act_round undid GPTQ)
scale_sweep only 0.33
GPTQ + scale_sweep (this artifact) 0.33

The prior pipeline's act_round polish was a closed-form per-weight Δw²·E[a²] minimization at the fixed group scale. It turned out to systematically undo GPTQ's cross-column error propagation — the per-weight metric minima don't respect GPTQ's compensation structure. scale_sweep replaces it as a strict improvement.

AWQ's γ-fold is not applied. On NVFP4's 16-channel groups, AWQ's per-channel rescaling pushes mixed-scale values into the same group and inflates per-group quant noise rather than reducing it (measured: baseline PPL 4.97, AWQ-only 16.44 — +230 %).


Which layers are quantized

Text body (DeltaNet linear-attention + dense MoE, 40 layers)

  • Full attention Linears (q_proj, k_proj, v_proj, o_proj): mixed NVFP4 / MXFP8 / BF16 per-Linear by sensitivity
  • DeltaNet linear-attention Linears (in_proj_qkv, in_proj_z, in_proj_a, in_proj_b, out_proj): same
  • MoE experts (gate_up_proj, down_proj): per-expert NVFP4 with joint per-tensor scale across the gate_up pair so vLLM FusedMoE loads them
  • Shared expert MLP: same per-Linear policy
  • Router (mlp.gate): always BF16 (tiny, sensitive)

Multi-token-prediction (MTP) head

  • Speculative-decoding head (1 layer) + its own MoE block: same per-Linear policy, so --speculative-config method=mtp drafts at the same precision as the body.

Visual encoder (27 blocks — Qwen3.6-VL vision tower)

  • Visual Linears were forced with --visual-format=BF16.
  • 110 visual Linears were assigned BF16 uniformly.
  • model.visual.pos_embed remains BF16 passthrough as a Parameter.

Passthrough (unquantized)

  • lm_head — kept at BF16 because vLLM's ParallelLMHead module only accepts a single weight parameter. The allocator measures lm_head's Fisher sensitivity and would pick NVFP4 for it (saving ~770 MB), but the compressed-tensors runtime rejects a compressed lm_head with KeyError: lm_head.input_global_scale because its scheme registry doesn't include ParallelLMHead. This is a vLLM runtime limitation, not a PrismaQuant design decision.
  • RMSNorm weights (all layers + MTP + visual)
  • All biases
  • embed_tokens
  • model.visual.pos_embed (Parameter/Embedding, see above)

Serving (vLLM only)

This artifact is only runnable via vLLM's stock compressed-tensors support — there is no transformers-native runtime path for mixed NVFP4 + MXFP8 with packed-MoE experts today. vLLM 0.11+ or equivalent is required.

vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  • FlashInfer NVFP4 attention is picked up automatically; set VLLM_USE_FLASHINFER_NVFP4=1 to make the preference explicit.
  • MTP speculative decoding at n=3 is the measured optimum for this family on DGX Spark (n=2 leaves ~10 % tok/s on the table, n=4 regresses).
  • Visual inputs work via vLLM's standard image-text-to-text chat API — no special flags.

Reproducing this artifact

Full pipeline is in the PrismaQuant repo:

  1. Sensitivity probe — streaming per-shard empirical-Fisher trace (diagonal) across body + MTP + visual Linears. Each shard holds only its ~2 layers resident; the rest of the model is on disk or meta. 8 multimodal calibration samples drive visual Fisher through one unified streaming context.
  2. Per-(Linear, format) cost measurement — for each Linear and each candidate format, the per-group RTN error weighted by cached input activations. Incremental: same per-shard streaming as the probe.
  3. Multi-choice knapsack allocator — picks one format per Linear minimizing total predicted Δloss under the bit budget. Target 4.9 bpp; achieved 4.907 bpp here. The same run reported a knee near target 5.0 (achieved 4.971). Known-non-Linear rank-2 tensors (pos_embed, rotary_emb) are excluded from the visual pool.
  4. Export — streams each body / visual / MTP shard, applies GPTQ + scale_sweep to NVFP4 entries, and writes compressed-tensors shards. Export recipe for this run: 511 entries with mix {NVFP4: 90, BF16: 404, MXFP8: 17}. lm_head passthrough at BF16 is enforced at this stage (see known issues).

Wall-clock on a DGX Spark (128 GB unified memory): ~15 min on cached probe + cost + activation shards (body shards are invariant across export-pass flag changes, so only the final export stage reruns when you change a flag).


Known issues / limitations

  • vLLM only at serve time. No transformers-runtime path for this precision mix today.
  • lm_head stays BF16 because vLLM's ParallelLMHead does not register the NVFP4/MXFP8 compressed-tensors schemes. Allocator measured it and would have picked NVFP4; the runtime limitation forces BF16. Costs ~770 MB on the disk footprint.
  • MTP n=4 regresses on this family. Stick to n=3 unless you verify against the draft-head acceptance-rate trace.
  • Export fast path was unavailable in this run. The exporter fell back to the torch implementation because flash-linear-attention and/or causal-conv1d dependencies were missing.
  • PyTorch CUDA capability warning on GB10. The environment printed a warning that GPU capability 12.1 is outside the declared max (12.0) for that specific torch build.

Links

Citation

@software{prismaquant2026,
  title        = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
                  quantization for LLMs},
  author       = {Tand, Rob},
  year         = 2026,
  url          = {https://github.com/RobTand/prismaquant},
}
Downloads last month
569
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyburn/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm

Quantized
(247)
this model