Qwen3.6-27B-W4A16-G128

W4A16 GPTQ quantization of Qwen/Qwen3.6-27B, targeted at single-GPU vLLM serving on Ada-class GPUs (L40S, RTX 6000 Ada, RTX 4090) where the FP8 checkpoint leaves too little VRAM for usable KV cache at long context.

TL;DR

FP8 (upstream) This (W4A16-G128)
Disk size 29 GiB 19 GiB
KV cache @ L40S 46 GiB, FP8 KV, MTP on, util 0.95 9.86 GiB → 75K tokens at 262K max 21.79 GiB → 166K tokens at 200K max
Single-stream decode tok/s (random 1024/256, MTP=2) 40.1 55.7 (+39 %)
Single-stream TPOT 24.9 ms 16.6 ms (-33 %)
Concurrent @16 output tok/s 219 265 (+21 %)
Concurrent @16 TTFT p50 9.3 s 8.0 s (-14 %)

Quantization wins on every measured axis except single-stream prefill TTFT (+13 %, gptq_marlin dequant cost during prefill - mitigated under concurrent load).

What is quantized

Quantized (W4A16, group_size=128, symmetric, desc_act on):

  • All model.language_model.layers.{0..63}.self_attn.{q,k,v,o}_proj (16 full-attention layers)
  • All model.language_model.layers.{0..63}.linear_attn.{in_proj_qkv,in_proj_z,out_proj} (48 Gated-DeltaNet layers)
  • All model.language_model.layers.{0..63}.mlp.{gate,up,down}_proj

Kept BF16:

  • Vision tower (model.visual.* - all 27 SigLIP-style blocks + merger)
  • MTP head (mtp.* - 15 tensors, 0.85 GiB) - required for speculative decoding correctness
  • lm_head.weight and model.language_model.embed_tokens.weight
  • All *_norm and *.conv1d.weight (Mamba conv)

Quantization recipe

  • Tool: gptqmodel 6.0.3
  • Stack: transformers 5.6.0, torch 2.11.0+cu130, torchvision 0.26.0
  • Calibration: 512 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=2048, applied with the model's chat template.
  • Hyperparameters: bits=4, group_size=128, sym=True, desc_act=True, damp_percent=0.01
  • Hardware: single L40S 46 GiB (sm_89, Ada Lovelace), driver 595.58.03, CUDA 13.2.
  • Wall clock: 53.4 minutes.
  • Per-layer GPTQ losses: 1e-9 to 1e-5 (very small - clean fit on hybrid attention).

vLLM serving

Tested on vllm 0.19.1 with transformers 5.6.0+. The compressed-tensors config is auto-detected; no --quantization flag required.

vllm serve LibertAIDAI/Qwen3.6-27B-W4A16-G128 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 8 \
  --max-model-len 200000 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image":4,"video":0,"audio":0}' \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Drop the --speculative-config flag if you do not want MTP - saves 0.5 GiB VRAM (4K extra KV tokens) at the cost of ~2x slower single-stream decode.

Caveats

  • Calibration was English+chat-only (ultrachat_200k). Multilingual and code-heavy workloads have not been measured against a held-out set; expect quality on par with other community W4A16 GPTQ Qwen quants but not specifically guaranteed.
  • Vision left BF16 - vision-only quality should match upstream Qwen3.6-27B-FP8 closely. Not redundantly evaluated.
  • MTP mtp.safetensors is shipped as a separate file referenced from model.safetensors.index.json. Keep both alongside the main shards or vLLM will boot without spec decoding (silent - no error, just 2x slower decode).
  • Tested only on Ada (L40S, sm_89). Should also work on Hopper, Blackwell, and Ampere via gptq_marlin, but not benchmarked there.
  • desc_act=True may slightly increase load time vs desc_act=False; chosen for the small accuracy benefit on long-tail distributions in linear-attention layers.

Files

  • model-{1..5}-of-00005.safetensors - main quantized weights (~18 GiB)
  • mtp.safetensors - BF16 MTP draft head (0.85 GiB), referenced from model.safetensors.index.json
  • quantize_config.json - gptqmodel config
  • quant_log.csv - per-module quantization losses + timings
  • Standard tokenizer / preprocessor / chat-template files inherited from upstream

License & attribution

Inherits Apache 2.0 from Qwen/Qwen3.6-27B. All rights, responsibilities, and acceptable-use policies of the upstream license apply.

Quantization performed by LibertAI. This artifact is in production use serving the qwen3.6-27b model alias on the LibertAI inference platform.

Downloads last month
2,097
Safetensors
Model size
28B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibertAIDAI/Qwen3.6-27B-W4A16-G128

Base model

Qwen/Qwen3.6-27B
Quantized
(205)
this model