Qwen3.6-27B GPTQ 4-bit

GPTQ 4-bit quantization of Qwen/Qwen3.6-27B, a 27B-parameter dense multimodal model. Uses FOEM (First-Order Error Matters, AAAI 2026) for outlier-aware error compensation, enabling pure 4-bit quantization without per-module FP16 exclusions.

Includes the full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

  • Architecture: Qwen3_5ForConditionalGeneration (multimodal: text + vision; dense sibling of qwen3_5_moe)
  • Total parameters: ~27B
  • Layers: 64 (48 linear-attention + 16 full-attention, repeating 3:1 pattern)
  • Hidden size: 5120, intermediate size: 17408 (dense MLP — no MoE)
  • Context length: 262,144 tokens
  • Vision encoder: 27-block ViT, BF16 (333 tensors)
  • MTP module: 1-layer speculative decoding head, BF16 (15 tensors)

Quantization Details

All quantizable Linear modules in the text decoder are quantized to INT4 using GPTQ + FOEM. The vision encoder, MTP module, norms, embeddings, and LM head remain at BF16/FP16 for quality preservation.

Component Precision Notes
mlp.{gate_proj, up_proj, down_proj} INT4 (GPTQ + FOEM) All 64 layers
self_attn.{q,k,v,o}_proj INT4 (GPTQ + FOEM) 16 full-attention layers
linear_attn.{in_proj_qkv, in_proj_z, out_proj} INT4 (GPTQ + FOEM) 48 linear-attention layers (GatedDeltaNet)
linear_attn.{in_proj_a, in_proj_b} FP16 Tiny projections, kept at full precision
Vision encoder (model.visual.*) BF16 333 tensors, full precision
MTP module (mtp.*) BF16 15 tensors, full precision
Embeddings, LM head, norms FP16/BF16 Full precision

GPTQ + FOEM configuration:

  • Bits: 4
  • Group size: 32
  • Symmetric: Yes
  • desc_act: No
  • true_sequential: Yes
  • act_group_aware: Yes
  • mse: 2.0 (activation-weighted MSE for outlier handling)
  • FOEM: alpha=0.25, beta=0.2 (FOEM + GPTAQ defaults)
  • Fallback: RTN at 0.5% threshold for unstable Hessians

What FOEM does

FOEM (First-Order Error Matters; Zheng et al., AAAI 2026) augments the standard GPTQ Hessian-based weight update with a first-order error compensation term. In pure-4-bit settings, this is the difference between a usable checkpoint and one that drifts on long-tail tokens — for this model it cuts perplexity degradation from +8.45% (vanilla GPTQ 4-bit) down to +1.95%, while keeping the same 4-bit format and vLLM kernel compatibility.

Calibration

  • Dataset: Mixed — evol-codealpaca-v1 (code) + C4 (general English text)
  • Samples: 256, binned uniformly across context lengths 256–2048 tokens
  • Quantizer: GPTQModel v6.0.3
  • Note: this is general-purpose calibration. Calibrating on wikitext directly would yield lower wikitext perplexity but worse out-of-distribution performance; we optimized for the latter.

Model Size

Version Size Compression
BF16 (original) ~50 GB
GPTQ 8-bit 32 GB 1.6×
GPTQ 4-bit FOEM 21 GB 2.4×

The 21 GB total includes the BF16 vision encoder (1.2 GB) and BF16 MTP head (0.4 GB) which are kept at full precision. The quantized text decoder alone is ~19 GB.

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:

Model Perplexity Degradation
BF16 (original) 7.0652
GPTQ 8-bit 7.0697 +0.07% (effectively lossless)
GPTQ 4-bit FOEM (this) 7.2032 +1.95%

For reference, vanilla GPTQ 4-bit (no FOEM) on this same model lands at PPL 7.6626 (+8.45%) — FOEM recovers most of that gap at the same disk size.

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.6-27B-GPTQ-4bit \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144 \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'
Parameter Description
--tensor-parallel-size 4 Shard across 4 GPUs (adjust to your setup)
--gpu-memory-utilization 0.95 Use 95% of GPU VRAM for KV cache + weights
--max-model-len 262144 Full 256K context window support
--dtype float16 Run in FP16 (required for ROCm GPTQ kernels)
--skip-mm-profiling Skip multimodal memory profiling at startup
--limit-mm-per-prompt '{"image": 2}' Allow up to 2 images per request

vLLM bug workaround (may apply): Up through at least vLLM 0.19.x, Qwen3_5TextConfig defines ignore_keys_at_rope_validation as a list instead of a set, causing a TypeError during config parsing. Apply this patch before serving if you hit the error:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.6-27B-GPTQ-4bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Loads natively under GPTQModel v6.0.3+ via the upstream Qwen3_5QModel definition (which uses AutoModelForImageTextToText and the multimodal model.language_model.layers.* weight prefix). No checkpoint patching needed.

from gptqmodel import GPTQModel
model = GPTQModel.load("btbtyler09/Qwen3.6-27B-GPTQ-4bit", trust_remote_code=True)

Technical Notes

Qwen3.6-27B is a dense multimodal model — it shares the Qwen3_5ForConditionalGeneration wrapper with the MoE-based Qwen3.6-35B-A3B but uses a standard dense MLP in every decoder layer instead of an expert mixture. The text decoder alternates 3 linear-attention (GatedDeltaNet) layers with 1 full-attention layer, repeated 16 times for 64 total layers.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text decoder's quantizable Linear modules are converted to INT4.

Credits

  • Base Model: Qwen — Qwen3.6-27B
  • Quantization: GPTQ + FOEM via GPTQModel v6.0.3
  • FOEM paper: Zheng et al., "First-Order Error Matters: Accurate Compensation for Quantized Large Language Models", AAAI 2026
  • Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
513
Safetensors
Model size
28B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for btbtyler09/Qwen3.6-27B-GPTQ-4bit

Base model

Qwen/Qwen3.6-27B
Quantized
(205)
this model