huginnfork/Qwen3.6-27B-NVFP4A16

NVFP4A16 quantisation of Qwen/Qwen3.6-27B, with the MTP head and vision tower preserved in bf16.

Provenance

Base: Qwen/Qwen3.6-27B (bf16)
MTP head: re-grafted from Qwen/Qwen3.6-27B (15 tensors, ~810 MB bf16) so SGLang/vLLM speculative decoding (--speculative-algo NEXTN) works
Vision tower: model.visual.* preserved in bf16 (333 tensors)

Quantisation

Format: NVFP4A16 W4A16 (4-bit FP4 weights with FP8 E4M3 per-group scales, group_size=16; bf16 activations) via llm-compressor 0.10 + compressed-tensors 0.14
Scheme: see recipe.yaml — NVFP4A16 keeps activations in bf16, which is the dominant KLD source under W4A4 on this hybrid stack.
Kept in bf16 (quantization_config.ignore): lm_head, all model.visual.*, all mtp.*, and the entire linear_attn Mamba/SSM block (in_proj_*, out_proj, conv1d)

KL divergence measurements

KLD computed with eval_kld.py — per-token KLD averaged over 8 samples from neuralmagic/calibration (LLM split), max_seq=1024. Max sample KLD is the highest single-sample mean (catches outliers that the overall mean hides).

Comparison	Mean KLD (nats)	Max sample KLD	Samples	max_seq
vs Qwen3.6-27B base	0.0747	0.1947	8	1024

Note: this pipeline always uploads the resulting checkpoint. Consult the KL divergence numbers above to judge whether the result is acceptable for your use case.

Perplexity (wikitext-2-raw)

Wikitext-2-raw test split, non-overlapping chunks of 2048 tokens, computed with eval_ppl.py. Same tokenizer for every row so the numbers compare apples-to-apples.

Model	Perplexity	Tokens scored	Dataset	seq
Qwen3.6-27B base (bf16)	7.3057	296907	`wikitext/wikitext-2-raw-v1/test`	2048
this checkpoint	7.6520	296907	`wikitext/wikitext-2-raw-v1/test`	2048

Inference

transformers (text + vision; MTP not exercised)

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

repo = "huginnfork/Qwen3.6-27B-NVFP4A16"
proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    repo, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)

vLLM (NVFP4A16 + MTP speculative decoding)

vllm serve huginnfork/Qwen3.6-27B-NVFP4A16 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --quantization compressed-tensors \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

FP4 weights are unpacked to bf16 at compute time. Native FP4 GEMM requires Blackwell (SM100+); on older GPUs vLLM dequantises at runtime.

Downloads last month: 170

Safetensors

Model size

20B params

Tensor type

F32

BF16

F8_E4M3

Model tree for huginnfork/Qwen3.6-27B-NVFP4A16

Base model

Qwen/Qwen3.6-27B

Quantized

(204)

this model