huginnfork/Qwen3.6-27B-NVFP4A16
NVFP4A16 quantisation of Qwen/Qwen3.6-27B, with the MTP head and vision tower preserved in bf16.
Provenance
- Base:
Qwen/Qwen3.6-27B(bf16) - MTP head: re-grafted from
Qwen/Qwen3.6-27B(15 tensors, ~810 MB bf16) so SGLang/vLLM speculative decoding (--speculative-algo NEXTN) works - Vision tower:
model.visual.*preserved in bf16 (333 tensors)
Quantisation
- Format: NVFP4A16 W4A16 (4-bit FP4 weights with FP8 E4M3 per-group scales, group_size=16; bf16 activations) via
llm-compressor0.10 +compressed-tensors0.14 - Scheme: see
recipe.yaml—NVFP4A16keeps activations in bf16, which is the dominant KLD source under W4A4 on this hybrid stack. - Kept in bf16 (
quantization_config.ignore):lm_head, allmodel.visual.*, allmtp.*, and the entirelinear_attnMamba/SSM block (in_proj_*,out_proj,conv1d)
KL divergence measurements
KLD computed with eval_kld.py — per-token KLD averaged over 8 samples from neuralmagic/calibration (LLM split), max_seq=1024. Max sample KLD is the highest single-sample mean (catches outliers that the overall mean hides).
| Comparison | Mean KLD (nats) | Max sample KLD | Samples | max_seq |
|---|---|---|---|---|
| vs Qwen3.6-27B base | 0.0747 | 0.1947 | 8 | 1024 |
Note: this pipeline always uploads the resulting checkpoint. Consult the KL divergence numbers above to judge whether the result is acceptable for your use case.
Perplexity (wikitext-2-raw)
Wikitext-2-raw test split, non-overlapping chunks of 2048 tokens, computed with eval_ppl.py. Same tokenizer for every row so the numbers compare apples-to-apples.
| Model | Perplexity | Tokens scored | Dataset | seq |
|---|---|---|---|---|
| Qwen3.6-27B base (bf16) | 7.3057 | 296907 | wikitext/wikitext-2-raw-v1/test |
2048 |
| this checkpoint | 7.6520 | 296907 | wikitext/wikitext-2-raw-v1/test |
2048 |
Inference
transformers (text + vision; MTP not exercised)
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
repo = "huginnfork/Qwen3.6-27B-NVFP4A16"
proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
repo, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
vLLM (NVFP4A16 + MTP speculative decoding)
vllm serve huginnfork/Qwen3.6-27B-NVFP4A16 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--quantization compressed-tensors \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'
FP4 weights are unpacked to bf16 at compute time. Native FP4 GEMM requires Blackwell (SM100+); on older GPUs vLLM dequantises at runtime.
- Downloads last month
- 170
Model tree for huginnfork/Qwen3.6-27B-NVFP4A16
Base model
Qwen/Qwen3.6-27B