Qwen3.6-27B INT8 AutoRound
This is an unofficial INT8 quantized version of the Qwen3.6-27B. It was created using AutoRound.
Available versions
- There are two versions.
- Main branch one is a little bit smaller by quantizing self_attn, and disabling group_size at the cost of the model's intelligence.
- For users with 48GB VRAM, just using Main branch is recommended. If you have more than that, gs128 branch might be better. The performance difference in practical use is minimal.
Quantization details
| Field | Main branch | gs128 branch |
|---|---|---|
| Base | Qwen/Qwen3.6-27B |
Qwen/Qwen3.6-27B |
| Method | AutoRound (intel/auto-round), custom recipe |
AutoRound (intel/auto-round), default recipe |
| Scheme | W8A16 | W8A16 |
| Bits | 8 | 8 |
| Group size | -1 | 128 |
| Symmetric | yes | yes |
| Unquantized layers | visual, mtp, linear_attn, embed_tokens, lm_head |
visual, mtp, self_attn, linear_attn, embed_tokens, lm_head |
| Calibration samples | 128 | 128 |
| Iterations | 1000 | 200 |
| Batch size | 8 | 8 |
| torch.compile | enabled | enabled |
| Size | 36.8GB | 38.8GB |
| GPU used for quant | 2脳 RTX 3090 | 2脳 RTX 3090 |
- For more information, please check quantize.py.
Quantization log
- Please check log.txt.
How to use
This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly.
vLLM is recommended.
Example args (For 2x 3090 Users) :
vllm serve ./Qwen3.6-27B-INT8-AutoRound \
--tensor-parallel-size 2 \
--attention-backend FLASHINFER \
--performance-mode interactivity \
--max-model-len auto \
--max-num-batched-tokens 2048 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.932 \
--compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[3]}' \
-O3 \
--async-scheduling \
--language-model-only \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--default-chat-template-kwargs.preserve_thinking true \
--mamba-cache-mode all \
--mamba-block-size 8 \
--enable-prefix-caching \
--enable-chunked-prefill
- With these settings, you get around 129k context. You can also add --kv-cache-dtype fp8_e4m3 --calculate-kv-scales args to get about 252k tokens.
- You can add --enforce-eager (you might need to remove --compilation-config) or set the PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False environment variable (requires --disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower.
- Remove --speculative-config if you really want more context, but I highly recommend keeping it.
- Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.
Acknowledgements
- Lorbus for the README.md format
- Alibaba / Qwen team for the base Qwen3.6-27B model
- Intel AutoRound team for the quantization framework
- vLLM project for the inference engine and Qwen3_5 MTP support
- Downloads last month
- 1,505
Model tree for Minachist/Qwen3.6-27B-INT8-AutoRound
Base model
Qwen/Qwen3.6-27B