Qwen3.6-27B GPTQ 4-bit

GPTQ 4-bit quantization of Qwen/Qwen3.6-27B, a 27B-parameter dense multimodal model. Uses FOEM (First-Order Error Matters, AAAI 2026) for outlier-aware error compensation, enabling pure 4-bit quantization without per-module FP16 exclusions.

Includes the full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

Architecture: Qwen3_5ForConditionalGeneration (multimodal: text + vision; dense sibling of qwen3_5_moe)
Total parameters: ~27B
Layers: 64 (48 linear-attention + 16 full-attention, repeating 3:1 pattern)
Hidden size: 5120, intermediate size: 17408 (dense MLP — no MoE)
Context length: 262,144 tokens
Vision encoder: 27-block ViT, BF16 (333 tensors)
MTP module: 1-layer speculative decoding head, BF16 (15 tensors)

Quantization Details

All quantizable Linear modules in the text decoder are quantized to INT4 using GPTQ + FOEM. The vision encoder, MTP module, norms, embeddings, and LM head remain at BF16/FP16 for quality preservation.

Component	Precision	Notes
`mlp.{gate_proj, up_proj, down_proj}`	INT4 (GPTQ + FOEM)	All 64 layers
`self_attn.{q,k,v,o}_proj`	INT4 (GPTQ + FOEM)	16 full-attention layers
`linear_attn.{in_proj_qkv, in_proj_z, out_proj}`	INT4 (GPTQ + FOEM)	48 linear-attention layers (GatedDeltaNet)
`linear_attn.{in_proj_a, in_proj_b}`	FP16	Tiny projections, kept at full precision
Vision encoder (`model.visual.*`)	BF16	333 tensors, full precision
MTP module (`mtp.*`)	BF16	15 tensors, full precision
Embeddings, LM head, norms	FP16/BF16	Full precision

GPTQ + FOEM configuration:

Bits: 4
Group size: 32
Symmetric: Yes
desc_act: No
true_sequential: Yes
act_group_aware: Yes
mse: 2.0 (activation-weighted MSE for outlier handling)
FOEM: alpha=0.25, beta=0.2 (FOEM + GPTAQ defaults)
Fallback: RTN at 0.5% threshold for unstable Hessians

What FOEM does

FOEM (First-Order Error Matters; Zheng et al., AAAI 2026) augments the standard GPTQ Hessian-based weight update with a first-order error compensation term. In pure-4-bit settings, this is the difference between a usable checkpoint and one that drifts on long-tail tokens — for this model it cuts perplexity degradation from +8.45% (vanilla GPTQ 4-bit) down to +1.95%, while keeping the same 4-bit format and vLLM kernel compatibility.

Calibration

Dataset: Mixed — evol-codealpaca-v1 (code) + C4 (general English text)
Samples: 256, binned uniformly across context lengths 256–2048 tokens
Quantizer: GPTQModel v6.0.3
Note: this is general-purpose calibration. Calibrating on wikitext directly would yield lower wikitext perplexity but worse out-of-distribution performance; we optimized for the latter.

Model Size

Version	Size	Compression
BF16 (original)	~50 GB	—
GPTQ 8-bit	32 GB	1.6×
GPTQ 4-bit FOEM	21 GB	2.4×

The 21 GB total includes the BF16 vision encoder (~~1.2 GB) and BF16 MTP head (~~0.4 GB) which are kept at full precision. The quantized text decoder alone is ~19 GB.

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:

Model	Perplexity	Degradation
BF16 (original)	7.0652	—
GPTQ 8-bit	7.0697	+0.07% (effectively lossless)
GPTQ 4-bit FOEM (this)	7.2032	+1.95%

For reference, vanilla GPTQ 4-bit (no FOEM) on this same model lands at PPL 7.6626 (+8.45%) — FOEM recovers most of that gap at the same disk size.

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.6-27B-GPTQ-4bit \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144 \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'

Parameter	Description
`--tensor-parallel-size 4`	Shard across 4 GPUs (adjust to your setup)
`--gpu-memory-utilization 0.95`	Use 95% of GPU VRAM for KV cache + weights
`--max-model-len 262144`	Full 256K context window support
`--dtype float16`	Run in FP16 (required for ROCm GPTQ kernels)
`--skip-mm-profiling`	Skip multimodal memory profiling at startup
`--limit-mm-per-prompt '{"image": 2}'`	Allow up to 2 images per request

vLLM bug workaround (may apply): Up through at least vLLM 0.19.x, Qwen3_5TextConfig defines ignore_keys_at_rope_validation as a list instead of a set, causing a TypeError during config parsing. Apply this patch before serving if you hit the error:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.6-27B-GPTQ-4bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Loads natively under GPTQModel v6.0.3+ via the upstream Qwen3_5QModel definition (which uses AutoModelForImageTextToText and the multimodal model.language_model.layers.* weight prefix). No checkpoint patching needed.

from gptqmodel import GPTQModel
model = GPTQModel.load("btbtyler09/Qwen3.6-27B-GPTQ-4bit", trust_remote_code=True)

Technical Notes

Qwen3.6-27B is a dense multimodal model — it shares the Qwen3_5ForConditionalGeneration wrapper with the MoE-based Qwen3.6-35B-A3B but uses a standard dense MLP in every decoder layer instead of an expert mixture. The text decoder alternates 3 linear-attention (GatedDeltaNet) layers with 1 full-attention layer, repeated 16 times for 64 total layers.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text decoder's quantizable Linear modules are converted to INT4.

Credits

Base Model: Qwen — Qwen3.6-27B
Quantization: GPTQ + FOEM via GPTQModel v6.0.3
FOEM paper: Zheng et al., "First-Order Error Matters: Accurate Compensation for Quantized Large Language Models", AAAI 2026
Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month: 513

Safetensors

Model size

28B params

Tensor type

I32

BF16

Model tree for btbtyler09/Qwen3.6-27B-GPTQ-4bit

Base model

Qwen/Qwen3.6-27B

Quantized

(205)

this model