Qwen3.6-27B-W4A16-G128

W4A16 GPTQ quantization of Qwen/Qwen3.6-27B, targeted at single-GPU vLLM serving on Ada-class GPUs (L40S, RTX 6000 Ada, RTX 4090) where the FP8 checkpoint leaves too little VRAM for usable KV cache at long context.

TL;DR

	FP8 (upstream)	This (W4A16-G128)
Disk size	29 GiB	19 GiB
KV cache @ L40S 46 GiB, FP8 KV, MTP on, util 0.95	9.86 GiB → 75K tokens at 262K max	21.79 GiB → 166K tokens at 200K max
Single-stream decode tok/s (random 1024/256, MTP=2)	40.1	55.7 (+39 %)
Single-stream TPOT	24.9 ms	16.6 ms (-33 %)
Concurrent @16 output tok/s	219	265 (+21 %)
Concurrent @16 TTFT p50	9.3 s	8.0 s (-14 %)

Quantization wins on every measured axis except single-stream prefill TTFT (+13 %, gptq_marlin dequant cost during prefill - mitigated under concurrent load).

What is quantized

Quantized (W4A16, group_size=128, symmetric, desc_act on):

All model.language_model.layers.{0..63}.self_attn.{q,k,v,o}_proj (16 full-attention layers)
All model.language_model.layers.{0..63}.linear_attn.{in_proj_qkv,in_proj_z,out_proj} (48 Gated-DeltaNet layers)
All model.language_model.layers.{0..63}.mlp.{gate,up,down}_proj

Kept BF16:

Vision tower (model.visual.* - all 27 SigLIP-style blocks + merger)
MTP head (mtp.* - 15 tensors, 0.85 GiB) - required for speculative decoding correctness
lm_head.weight and model.language_model.embed_tokens.weight
All *_norm and *.conv1d.weight (Mamba conv)

Quantization recipe

Tool: gptqmodel 6.0.3
Stack: transformers 5.6.0, torch 2.11.0+cu130, torchvision 0.26.0
Calibration: 512 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=2048, applied with the model's chat template.
Hyperparameters: bits=4, group_size=128, sym=True, desc_act=True, damp_percent=0.01
Hardware: single L40S 46 GiB (sm_89, Ada Lovelace), driver 595.58.03, CUDA 13.2.
Wall clock: 53.4 minutes.
Per-layer GPTQ losses: 1e-9 to 1e-5 (very small - clean fit on hybrid attention).

vLLM serving

Tested on vllm 0.19.1 with transformers 5.6.0+. The compressed-tensors config is auto-detected; no --quantization flag required.

vllm serve LibertAIDAI/Qwen3.6-27B-W4A16-G128 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 8 \
  --max-model-len 200000 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image":4,"video":0,"audio":0}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Drop the --speculative-config flag if you do not want MTP - saves ~~0.5 GiB VRAM (~~4K extra KV tokens) at the cost of ~2x slower single-stream decode.

Caveats

Calibration was English+chat-only (ultrachat_200k). Multilingual and code-heavy workloads have not been measured against a held-out set; expect quality on par with other community W4A16 GPTQ Qwen quants but not specifically guaranteed.
Vision left BF16 - vision-only quality should match upstream Qwen3.6-27B-FP8 closely. Not redundantly evaluated.
MTP mtp.safetensors is shipped as a separate file referenced from model.safetensors.index.json. Keep both alongside the main shards or vLLM will boot without spec decoding (silent - no error, just 2x slower decode).
Tested only on Ada (L40S, sm_89). Should also work on Hopper, Blackwell, and Ampere via gptq_marlin, but not benchmarked there.
desc_act=True may slightly increase load time vs desc_act=False; chosen for the small accuracy benefit on long-tail distributions in linear-attention layers.

Files

model-{1..5}-of-00005.safetensors - main quantized weights (~18 GiB)
mtp.safetensors - BF16 MTP draft head (0.85 GiB), referenced from model.safetensors.index.json
quantize_config.json - gptqmodel config
quant_log.csv - per-module quantization losses + timings
Standard tokenizer / preprocessor / chat-template files inherited from upstream

License & attribution

Inherits Apache 2.0 from Qwen/Qwen3.6-27B. All rights, responsibilities, and acceptable-use policies of the upstream license apply.

Quantization performed by LibertAI. This artifact is in production use serving the qwen3.6-27b model alias on the LibertAI inference platform.

Downloads last month: 2,097

Safetensors

Model size

28B params

Tensor type

BF16

I32

Model tree for LibertAIDAI/Qwen3.6-27B-W4A16-G128

Base model

Qwen/Qwen3.6-27B

Quantized

(205)

this model