Qwen3.6-27B-W4A16-G128
W4A16 GPTQ quantization of Qwen/Qwen3.6-27B, targeted at single-GPU vLLM serving on Ada-class GPUs (L40S, RTX 6000 Ada, RTX 4090) where the FP8 checkpoint leaves too little VRAM for usable KV cache at long context.
TL;DR
| FP8 (upstream) | This (W4A16-G128) | |
|---|---|---|
| Disk size | 29 GiB | 19 GiB |
| KV cache @ L40S 46 GiB, FP8 KV, MTP on, util 0.95 | 9.86 GiB → 75K tokens at 262K max | 21.79 GiB → 166K tokens at 200K max |
| Single-stream decode tok/s (random 1024/256, MTP=2) | 40.1 | 55.7 (+39 %) |
| Single-stream TPOT | 24.9 ms | 16.6 ms (-33 %) |
| Concurrent @16 output tok/s | 219 | 265 (+21 %) |
| Concurrent @16 TTFT p50 | 9.3 s | 8.0 s (-14 %) |
Quantization wins on every measured axis except single-stream prefill TTFT (+13 %, gptq_marlin dequant cost during prefill - mitigated under concurrent load).
What is quantized
Quantized (W4A16, group_size=128, symmetric, desc_act on):
- All
model.language_model.layers.{0..63}.self_attn.{q,k,v,o}_proj(16 full-attention layers) - All
model.language_model.layers.{0..63}.linear_attn.{in_proj_qkv,in_proj_z,out_proj}(48 Gated-DeltaNet layers) - All
model.language_model.layers.{0..63}.mlp.{gate,up,down}_proj
Kept BF16:
- Vision tower (
model.visual.*- all 27 SigLIP-style blocks + merger) - MTP head (
mtp.*- 15 tensors, 0.85 GiB) - required for speculative decoding correctness lm_head.weightandmodel.language_model.embed_tokens.weight- All
*_normand*.conv1d.weight(Mamba conv)
Quantization recipe
- Tool: gptqmodel
6.0.3 - Stack:
transformers 5.6.0,torch 2.11.0+cu130,torchvision 0.26.0 - Calibration: 512 samples from
HuggingFaceH4/ultrachat_200k,max_seq_length=2048, applied with the model's chat template. - Hyperparameters:
bits=4, group_size=128, sym=True, desc_act=True, damp_percent=0.01 - Hardware: single L40S 46 GiB (sm_89, Ada Lovelace), driver 595.58.03, CUDA 13.2.
- Wall clock: 53.4 minutes.
- Per-layer GPTQ losses: 1e-9 to 1e-5 (very small - clean fit on hybrid attention).
vLLM serving
Tested on vllm 0.19.1 with transformers 5.6.0+. The compressed-tensors config is auto-detected; no --quantization flag required.
vllm serve LibertAIDAI/Qwen3.6-27B-W4A16-G128 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 8 \
--max-model-len 200000 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--limit-mm-per-prompt '{"image":4,"video":0,"audio":0}' \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
Drop the --speculative-config flag if you do not want MTP - saves 0.5 GiB VRAM (4K extra KV tokens) at the cost of ~2x slower single-stream decode.
Caveats
- Calibration was English+chat-only (ultrachat_200k). Multilingual and code-heavy workloads have not been measured against a held-out set; expect quality on par with other community W4A16 GPTQ Qwen quants but not specifically guaranteed.
- Vision left BF16 - vision-only quality should match upstream Qwen3.6-27B-FP8 closely. Not redundantly evaluated.
- MTP
mtp.safetensorsis shipped as a separate file referenced frommodel.safetensors.index.json. Keep both alongside the main shards or vLLM will boot without spec decoding (silent - no error, just 2x slower decode). - Tested only on Ada (L40S, sm_89). Should also work on Hopper, Blackwell, and Ampere via gptq_marlin, but not benchmarked there.
desc_act=Truemay slightly increase load time vsdesc_act=False; chosen for the small accuracy benefit on long-tail distributions in linear-attention layers.
Files
model-{1..5}-of-00005.safetensors- main quantized weights (~18 GiB)mtp.safetensors- BF16 MTP draft head (0.85 GiB), referenced frommodel.safetensors.index.jsonquantize_config.json- gptqmodel configquant_log.csv- per-module quantization losses + timings- Standard tokenizer / preprocessor / chat-template files inherited from upstream
License & attribution
Inherits Apache 2.0 from Qwen/Qwen3.6-27B. All rights, responsibilities, and acceptable-use policies of the upstream license apply.
Quantization performed by LibertAI. This artifact is in production use serving the qwen3.6-27b model alias on the LibertAI inference platform.
- Downloads last month
- 2,097
Model tree for LibertAIDAI/Qwen3.6-27B-W4A16-G128
Base model
Qwen/Qwen3.6-27B