Qwen3.6-35B-A3B — PrismaQuant 4.9 bpp

Mixed-precision quantization of Qwen/Qwen3.6-35B-A3B produced by PrismaQuant — a per-Linear sensitivity-driven allocator that chooses each Linear module's format individually under a total-bit budget.

Why "every layer refracts into a different format": a naive uniform NVFP4 either leaves disk on the table (keeping everything BF16 "to be safe") or loses quality (quantizing sensitive layers to 4-bit). PrismaQuant measures the actual Fisher-weighted MSE for every (Linear, format) pair and runs a multi-choice knapsack under a total-bit budget, so every bit lives where it buys the most likelihood.

At a glance

Metric	BF16 source	This artifact	Delta
Size on disk	70 GB	22 GB	−69 %
Fraction of original weights	100 %	31 %
Average bits per param	16	4.907
Multimodal (vision + text)	✓	✓
MTP speculative decoding heads	✓	✓
Loads in vLLM (stock `compressed-tensors`)	✓	✓
Runtime backend	any	vLLM only

Precision mix

This checkpoint uses three precisions selected by PrismaQuant's allocator and exporter:

Allocator result at target=4.9: achieved 4.907 across costed Linears
- BF16: 294 layers
- NVFP4: 90 layers
- MXFP8_E4M3: 17 layers
Export recipe over all emitted entries:
- BF16: 404
- NVFP4: 90
- MXFP8: 17

The allocator also produced a Pareto curve and reported a suggested knee at target=5.0, achieved 4.971, with predicted Δloss=1.541e+02.

target	achieved	Δloss (pred)	NVFP4	MXFP8_E4M3	BF16
4.500	4.613	5.5531e+02	183	1	217
4.600	4.636	4.3781e+02	170	10	221
4.700	4.707	3.0143e+02	136	28	237
4.750	4.755	2.5822e+02	119	34	248
4.850	4.860	1.9177e+02	98	24	279
5.000	4.971	1.5405e+02	84	0	317
5.250	5.141	1.3768e+02	82	0	319
5.500	5.406	1.3168e+02	80	0	321
6.000	5.935	1.2179e+02	76	0	325
7.000	6.995	9.8870e+01	68	0	333
8.250	8.584	7.0943e+01	56	0	345

The allocator couples MoE gate_up_proj / down_proj so siblings share a scheme (vLLM's FusedMoE requires this), and fused attention siblings (q_proj/k_proj/v_proj) share one per-tensor global scale so the packed qkv_proj loads without the "accuracy mismatch" warning.

Activation-aware passes applied during export

On every NVFP4 weight the exporter runs, in order:

GPTQ-OBS one-shot rounding — block-wise error propagation along the group-quant structure using the calibration Hessian. Closed-form, not iterative. Handles cross-column activation coupling.
Closed-form per-group scale sweep — for each 16-weight NVFP4 group, enumerate grid=32 candidate scales spanning [0.5·s₀, 1.5·s₀], round each weight to its nearest codebook neighbor at every candidate scale, pick the (scale, rounding-set) configuration minimizing activation-weighted per-group MSE sum_j a_j² · (w_orig,j - w_q,j)². Improve-or-keep gate against the post-GPTQ weight. Handles within-group weight-distribution variation that GPTQ takes as fixed.

Scale_sweep is the closed-form analog of Intel's AutoRound — where AutoRound learns per-weight continuous rounding offsets V via 200 SGD iterations on a relaxed loss, scale_sweep enumerates the discrete scale dimension directly and lets RTN pick rounding conditional on scale. No gradient descent, sub-second per Linear.

Measured per-Linear output-MSE vs RTN baseline (Qwen3.6-35B, mixed visual + MTP Linears, geomean):

Pipeline variant	out_mse ratio vs RTN
RTN (no passes)	1.00
GPTQ only	0.41
GPTQ + act_round polish (prior pipeline)	0.99 (act_round undid GPTQ)
scale_sweep only	0.33
GPTQ + scale_sweep (this artifact)	0.33

The prior pipeline's act_round polish was a closed-form per-weight Δw²·E[a²] minimization at the fixed group scale. It turned out to systematically undo GPTQ's cross-column error propagation — the per-weight metric minima don't respect GPTQ's compensation structure. scale_sweep replaces it as a strict improvement.

AWQ's γ-fold is not applied. On NVFP4's 16-channel groups, AWQ's per-channel rescaling pushes mixed-scale values into the same group and inflates per-group quant noise rather than reducing it (measured: baseline PPL 4.97, AWQ-only 16.44 — +230 %).

Which layers are quantized

Text body (DeltaNet linear-attention + dense MoE, 40 layers)

Full attention Linears (q_proj, k_proj, v_proj, o_proj): mixed NVFP4 / MXFP8 / BF16 per-Linear by sensitivity
DeltaNet linear-attention Linears (in_proj_qkv, in_proj_z, in_proj_a, in_proj_b, out_proj): same
MoE experts (gate_up_proj, down_proj): per-expert NVFP4 with joint per-tensor scale across the gate_up pair so vLLM FusedMoE loads them
Shared expert MLP: same per-Linear policy
Router (mlp.gate): always BF16 (tiny, sensitive)

Multi-token-prediction (MTP) head

Speculative-decoding head (1 layer) + its own MoE block: same per-Linear policy, so --speculative-config method=mtp drafts at the same precision as the body.

Visual encoder (27 blocks — Qwen3.6-VL vision tower)

Visual Linears were forced with --visual-format=BF16.
110 visual Linears were assigned BF16 uniformly.
model.visual.pos_embed remains BF16 passthrough as a Parameter.

Passthrough (unquantized)

lm_head — kept at BF16 because vLLM's ParallelLMHead module only accepts a single weight parameter. The allocator measures lm_head's Fisher sensitivity and would pick NVFP4 for it (saving ~770 MB), but the compressed-tensors runtime rejects a compressed lm_head with KeyError: lm_head.input_global_scale because its scheme registry doesn't include ParallelLMHead. This is a vLLM runtime limitation, not a PrismaQuant design decision.
RMSNorm weights (all layers + MTP + visual)
All biases
embed_tokens
model.visual.pos_embed (Parameter/Embedding, see above)

Serving (vLLM only)

This artifact is only runnable via vLLM's stock compressed-tensors support — there is no transformers-native runtime path for mixed NVFP4 + MXFP8 with packed-MoE experts today. vLLM 0.11+ or equivalent is required.

vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

FlashInfer NVFP4 attention is picked up automatically; set VLLM_USE_FLASHINFER_NVFP4=1 to make the preference explicit.
MTP speculative decoding at n=3 is the measured optimum for this family on DGX Spark (n=2 leaves ~10 % tok/s on the table, n=4 regresses).
Visual inputs work via vLLM's standard image-text-to-text chat API — no special flags.

Reproducing this artifact

Full pipeline is in the PrismaQuant repo:

Sensitivity probe — streaming per-shard empirical-Fisher trace (diagonal) across body + MTP + visual Linears. Each shard holds only its ~2 layers resident; the rest of the model is on disk or meta. 8 multimodal calibration samples drive visual Fisher through one unified streaming context.
Per-(Linear, format) cost measurement — for each Linear and each candidate format, the per-group RTN error weighted by cached input activations. Incremental: same per-shard streaming as the probe.
Multi-choice knapsack allocator — picks one format per Linear minimizing total predicted Δloss under the bit budget. Target 4.9 bpp; achieved 4.907 bpp here. The same run reported a knee near target 5.0 (achieved 4.971). Known-non-Linear rank-2 tensors (pos_embed, rotary_emb) are excluded from the visual pool.
Export — streams each body / visual / MTP shard, applies GPTQ + scale_sweep to NVFP4 entries, and writes compressed-tensors shards. Export recipe for this run: 511 entries with mix {NVFP4: 90, BF16: 404, MXFP8: 17}. lm_head passthrough at BF16 is enforced at this stage (see known issues).

Wall-clock on a DGX Spark (128 GB unified memory): ~15 min on cached probe + cost + activation shards (body shards are invariant across export-pass flag changes, so only the final export stage reruns when you change a flag).

Known issues / limitations

vLLM only at serve time. No transformers-runtime path for this precision mix today.
lm_head stays BF16 because vLLM's ParallelLMHead does not register the NVFP4/MXFP8 compressed-tensors schemes. Allocator measured it and would have picked NVFP4; the runtime limitation forces BF16. Costs ~770 MB on the disk footprint.
MTP n=4 regresses on this family. Stick to n=3 unless you verify against the draft-head acceptance-rate trace.
Export fast path was unavailable in this run. The exporter fell back to the torch implementation because flash-linear-attention and/or causal-conv1d dependencies were missing.
PyTorch CUDA capability warning on GB10. The environment printed a warning that GPU capability 12.1 is outside the declared max (12.0) for that specific torch build.

Citation

@software{prismaquant2026,
  title        = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
                  quantization for LLMs},
  author       = {Tand, Rob},
  year         = 2026,
  url          = {https://github.com/RobTand/prismaquant},
}

Downloads last month: 569

Model tree for cyburn/Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(247)

this model

cyburn
/

Qwen3.6-35B-A3B-PrismaQuant-4.9bit-vllm