huginnfork/Qwen3.6-27B-NVFP4A16

NVFP4A16 quantisation of Qwen/Qwen3.6-27B, with the MTP head and vision tower preserved in bf16.

Provenance

  • Base: Qwen/Qwen3.6-27B (bf16)
  • MTP head: re-grafted from Qwen/Qwen3.6-27B (15 tensors, ~810 MB bf16) so SGLang/vLLM speculative decoding (--speculative-algo NEXTN) works
  • Vision tower: model.visual.* preserved in bf16 (333 tensors)

Quantisation

  • Format: NVFP4A16 W4A16 (4-bit FP4 weights with FP8 E4M3 per-group scales, group_size=16; bf16 activations) via llm-compressor 0.10 + compressed-tensors 0.14
  • Scheme: see recipe.yamlNVFP4A16 keeps activations in bf16, which is the dominant KLD source under W4A4 on this hybrid stack.
  • Kept in bf16 (quantization_config.ignore): lm_head, all model.visual.*, all mtp.*, and the entire linear_attn Mamba/SSM block (in_proj_*, out_proj, conv1d)

KL divergence measurements

KLD computed with eval_kld.py — per-token KLD averaged over 8 samples from neuralmagic/calibration (LLM split), max_seq=1024. Max sample KLD is the highest single-sample mean (catches outliers that the overall mean hides).

Comparison Mean KLD (nats) Max sample KLD Samples max_seq
vs Qwen3.6-27B base 0.0747 0.1947 8 1024

Note: this pipeline always uploads the resulting checkpoint. Consult the KL divergence numbers above to judge whether the result is acceptable for your use case.

Perplexity (wikitext-2-raw)

Wikitext-2-raw test split, non-overlapping chunks of 2048 tokens, computed with eval_ppl.py. Same tokenizer for every row so the numbers compare apples-to-apples.

Model Perplexity Tokens scored Dataset seq
Qwen3.6-27B base (bf16) 7.3057 296907 wikitext/wikitext-2-raw-v1/test 2048
this checkpoint 7.6520 296907 wikitext/wikitext-2-raw-v1/test 2048

Inference

transformers (text + vision; MTP not exercised)

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

repo = "huginnfork/Qwen3.6-27B-NVFP4A16"
proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    repo, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)

vLLM (NVFP4A16 + MTP speculative decoding)

vllm serve huginnfork/Qwen3.6-27B-NVFP4A16 \
    --trust-remote-code \
    --gpu-memory-utilization 0.85 \
    --max-model-len 8192 \
    --quantization compressed-tensors \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

FP4 weights are unpacked to bf16 at compute time. Native FP4 GEMM requires Blackwell (SM100+); on older GPUs vLLM dequantises at runtime.

Downloads last month
170
Safetensors
Model size
20B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for huginnfork/Qwen3.6-27B-NVFP4A16

Base model

Qwen/Qwen3.6-27B
Quantized
(204)
this model