Qwen3.6-35B-A3B-NVFP4
NVFP4 quantized version of Qwen/Qwen3.6-35B-A3B — the latest Qwen MoE with 256 experts, 3B active parameters, and state-of-the-art coding/agentic performance.
67 GB → 21.9 GB. Single NVIDIA Blackwell GPU. 168 tok/s.
Why This Model
Qwen3.6-35B-A3B is the new king of the MoE class:
- SWE-bench Verified: 73.4 — surpasses models 10x its active parameter count
- Terminal-Bench 2.0: 51.5 — best-in-class agentic coding
- QwenWebBench: 1397 ELO — real-world web task performance
- 256 experts, 3B active — extreme sparsity = extreme speed
- 262K-1M context — native 262K, extensible to 1 million tokens
- Gated DeltaNet + Attention hybrid — next-gen architecture
At NVFP4, it runs at 168 tok/s on a single Blackwell GPU — faster than Gemma4 MoE (130 tok/s) with dramatically better benchmark scores.
Key Specs
| Base model | Qwen/Qwen3.6-35B-A3B |
| Architecture | Qwen3.5 MoE — 35B total, 3B active, 256 experts (8 routed + 1 shared) |
| Quantization | NVFP4 W4A4 (weights FP4, activations FP4, scales FP8) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor (main) |
| Calibration | 512 samples, ultrachat_200k, seq_len=2048, moe_calibrate_all_experts=True |
| Size | 21.9 GB |
| Max context | 262,144 tokens (native) |
| Requires | NVIDIA Blackwell GPU (SM 120), vLLM nightly (cu130) |
Quickstart
vLLM
vllm serve Lna-Lab/Qwen3.6-35B-A3B-NVFP4 \
--max-model-len 32768 \
--reasoning-parser qwen3 \
--kv-cache-dtype fp8
With tool calling (agentic)
vllm serve Lna-Lab/Qwen3.6-35B-A3B-NVFP4 \
--max-model-len 32768 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-cache-dtype fp8
Docker
docker run --gpus '"device=0"' -p 8016:8016 \
-v /path/to/model:/models/current:ro \
--shm-size 16gb \
vllm/vllm-openai:cu130-nightly \
vllm serve /models/current --port 8016 --max-model-len 32768 \
--reasoning-parser qwen3 --kv-cache-dtype fp8
Benchmark
Single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM).
| Test | Speed | Tokens | Result |
|---|---|---|---|
| English (CAP theorem) | 161 tok/s | 256 | PASS |
| Code (async scheduler) | 162 tok/s | 512 | PASS |
| Math (Bayes' theorem) | 162 tok/s | 512 | PASS |
| Reasoning (architecture) | 163 tok/s | 512 | PASS |
| Container burst (x3) | 168 tok/s | 512 | PASS — stable |
Speed Comparison (NVFP4, single GPU)
| Model | Active Params | tok/s | Relative |
|---|---|---|---|
| Qwen3.6-35B MoE | 3B | 168 | 1.0x |
| Gemma4-26B MoE | 3.8B | 130 | 0.77x |
| Qwen3.5-27B Dense | 27B | 57 | 0.34x |
| Gemma4-31B Dense | 31B | 51 | 0.30x |
Quantization Details
Recipe
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)
Calibration
- Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
- Samples: 512
- Max sequence length: 2048
moe_calibrate_all_experts=True— ensures all 256 experts receive calibration data
Reproduction
from transformers import Qwen3_5MoeForConditionalGeneration, AutoProcessor, AutoTokenizer
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "Qwen/Qwen3.6-35B-A3B"
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
recipe = QuantizationModifier(
targets="Linear", scheme="NVFP4",
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
ds = ds.shuffle(seed=42)
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=2048,
truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
oneshot(model=model, dataset=ds, recipe=recipe,
max_seq_length=2048, num_calibration_samples=512,
moe_calibrate_all_experts=True)
model.save_pretrained("Qwen3.6-35B-A3B-NVFP4", save_compressed=True)
processor.save_pretrained("Qwen3.6-35B-A3B-NVFP4")
tokenizer.save_pretrained("Qwen3.6-35B-A3B-NVFP4")
Environment
| Package | Version |
|---|---|
| torch | 2.11.0+cu130 |
| transformers | 5.5.4 |
| llmcompressor | 0.1.dev (main @ 3084520) |
| compressed-tensors | 0.15.1a20260414 |
| CUDA | 13.0 |
Requirements
- GPU: NVIDIA Blackwell (SM 120)
- VRAM: ~22 GB minimum (model only)
- Software: vLLM nightly (cu130)
Notes
- Multimodal (vision) preserved in BF16.
- Gated DeltaNet layers are a hybrid attention+SSM architecture — unique to Qwen3.5/3.6.
- NVFP4 is Blackwell-specific. Will not work on Ampere/Hopper.
- Use
--kv-cache-dtype fp8for 2x KV capacity at no quality cost.
Credits
- Original model: Qwen
- Quantization tool: llm-compressor
- Quantized by: Lna-Lab
- Downloads last month
- 26,852
Model tree for sakamakismile/Qwen3.6-35B-A3B-NVFP4
Base model
Qwen/Qwen3.6-35B-A3B