Qwen3.6-35B-A3B INT8 AutoRound
This is an unofficial INT8 quantized version of the Qwen3.6-35B-A3B. It was created using AutoRound.
Available versions
- There are three versions.
- Main branch (gs-1) uses about 3.2GB less VRAM than the gs32 branch while maintaining nearly identical quality.
- For most users, just using Main branch is recommended. If you prioritize maximum quality, the
w8a16-gs128, orw8a16-gs32branch might be better. The performance difference in practical use is minimal. - To use the other version, specify
--revisionor switch branches in your download tool.
Benchmarks
- Used Qwen3.6-35B-A3B-INT8-AutoRound (gs128 branch) with default generation configs. Official evaluation protocol may differ.
| Benchmark | Mine (INT8 gs128) | Official (BF16) | Δ |
|---|---|---|---|
| MMLU-Redux | 93.28% ± 0.33% | 93.3% | −0.02% |
Quantization details
| Field | Main branch | w8a16-gs128 branch | w8a16-gs32 branch |
|---|---|---|---|
| Base | Qwen/Qwen3.6-35B-A3B |
Qwen/Qwen3.6-35B-A3B |
Qwen/Qwen3.6-35B-A3B |
| Method | AutoRound (intel/auto-round) |
AutoRound (intel/auto-round) |
AutoRound (intel/auto-round) |
| Scheme | W8A16 | W8A16 | W8A16 |
| Bits | 8 | 8 | 8 |
| Group size | -1 | 128 | 32 |
| Symmetric | yes | yes | yes |
| Unquantized layers | visual, mtp, linear_attn, mlp.gate, shared_expert, embed_tokens, lm_head |
Main + self_attn |
Main + self_attn |
| Calibration dataset | NeelNanda/pile-10k |
NeelNanda/pile-10k |
NeelNanda/pile-10k |
| Calibration samples | 512 | 128 | 768 |
| Iterations | 1000 | 175 | 1000 |
| Batch size | 8 | 36 | 16 |
| Sequence length | 2048 | 2048 | 4096 |
| GPU used for quant | 2× RTX 3090 | 2× RTX 3090 | 2× RTX 3090 |
How to use
- This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly.
- vLLM is recommended.
- Example args (For 2× 3090 Users):
vllm serve ./Qwen3.6-35B-A3B-INT8-AutoRound
--tensor-parallel-size 2
--attention-backend FLASHINFER
--performance-mode interactivity
--max-model-len auto
--max-num-batched-tokens 2048
--max-num-seqs 1
--gpu-memory-utilization 0.92
--compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}'
-O3
--async-scheduling
--language-model-only
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--enable-auto-tool-choice
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
--default-chat-template-kwargs.preserve_thinking true
--enable-prefix-caching
--enable-chunked-prefill
- With these settings, you get around 200k context with 210+ tk/s.
- Make sure to set VLLM_FLASHINFER_MOE_BACKEND=latency to get more tk/s.
- You can also add
--kv-cache-dtype fp8_e4m3 --calculate-kv-scalesargs to get more KV cache capacity. - You can add
--enforce-eager(you might need to remove--compilation-config) or set thePYTORCH_CUDA_ALLOC_CONF=expandable_segments:Falseenvironment variable (requires--disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower. - Remove
--speculative-configif you really want more context, but I highly recommend keeping it. - Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.
Acknowledgements
- Lorbus for the README.md format
- Alibaba / Qwen team for the base Qwen3.6-35B-A3B model
- Intel AutoRound team for the quantization framework
- vLLM project for the inference engine and Qwen3_5 MTP support
- Downloads last month
- 2,449
Model tree for Minachist/Qwen3.6-35B-A3B-INT8-AutoRound
Base model
Qwen/Qwen3.6-35B-A3B