Qwen3.6-35B-A3B-NVFP4

NVFP4 quantized version of Qwen/Qwen3.6-35B-A3B — the latest Qwen MoE with 256 experts, 3B active parameters, and state-of-the-art coding/agentic performance.

67 GB → 21.9 GB. Single NVIDIA Blackwell GPU. 168 tok/s.

Why This Model

Qwen3.6-35B-A3B is the new king of the MoE class:

SWE-bench Verified: 73.4 — surpasses models 10x its active parameter count
Terminal-Bench 2.0: 51.5 — best-in-class agentic coding
QwenWebBench: 1397 ELO — real-world web task performance
256 experts, 3B active — extreme sparsity = extreme speed
262K-1M context — native 262K, extensible to 1 million tokens
Gated DeltaNet + Attention hybrid — next-gen architecture

At NVFP4, it runs at 168 tok/s on a single Blackwell GPU — faster than Gemma4 MoE (130 tok/s) with dramatically better benchmark scores.

Key Specs


Base model	Qwen/Qwen3.6-35B-A3B
Architecture	Qwen3.5 MoE — 35B total, 3B active, 256 experts (8 routed + 1 shared)
Quantization	NVFP4 W4A4 (weights FP4, activations FP4, scales FP8)
Format	`compressed-tensors` (native vLLM support)
Tool	vllm-project/llm-compressor (main)
Calibration	512 samples, ultrachat_200k, seq_len=2048, moe_calibrate_all_experts=True
Size	21.9 GB
Max context	262,144 tokens (native)
Requires	NVIDIA Blackwell GPU (SM 120), vLLM nightly (cu130)

Quickstart

vLLM

vllm serve Lna-Lab/Qwen3.6-35B-A3B-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --kv-cache-dtype fp8

With tool calling (agentic)

vllm serve Lna-Lab/Qwen3.6-35B-A3B-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8

Docker

docker run --gpus '"device=0"' -p 8016:8016 \
    -v /path/to/model:/models/current:ro \
    --shm-size 16gb \
    vllm/vllm-openai:cu130-nightly \
    vllm serve /models/current --port 8016 --max-model-len 32768 \
    --reasoning-parser qwen3 --kv-cache-dtype fp8

Benchmark

Single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM).

Test	Speed	Tokens	Result
English (CAP theorem)	161 tok/s	256	PASS
Code (async scheduler)	162 tok/s	512	PASS
Math (Bayes' theorem)	162 tok/s	512	PASS
Reasoning (architecture)	163 tok/s	512	PASS
Container burst (x3)	168 tok/s	512	PASS — stable

Speed Comparison (NVFP4, single GPU)

Model	Active Params	tok/s	Relative
Qwen3.6-35B MoE	3B	168	1.0x
Gemma4-26B MoE	3.8B	130	0.77x
Qwen3.5-27B Dense	27B	57	0.34x
Gemma4-31B Dense	31B	51	0.30x

Quantization Details

Recipe

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

Calibration

Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
Samples: 512
Max sequence length: 2048
moe_calibrate_all_experts=True — ensures all 256 experts receive calibration data

Reproduction

from transformers import Qwen3_5MoeForConditionalGeneration, AutoProcessor, AutoTokenizer
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "Qwen/Qwen3.6-35B-A3B"

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=2048,
                     truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

oneshot(model=model, dataset=ds, recipe=recipe,
        max_seq_length=2048, num_calibration_samples=512,
        moe_calibrate_all_experts=True)

model.save_pretrained("Qwen3.6-35B-A3B-NVFP4", save_compressed=True)
processor.save_pretrained("Qwen3.6-35B-A3B-NVFP4")
tokenizer.save_pretrained("Qwen3.6-35B-A3B-NVFP4")

Environment

Package	Version
torch	2.11.0+cu130
transformers	5.5.4
llmcompressor	0.1.dev (main @ 3084520)
compressed-tensors	0.15.1a20260414
CUDA	13.0

Requirements

GPU: NVIDIA Blackwell (SM 120)
VRAM: ~22 GB minimum (model only)
Software: vLLM nightly (cu130)

Notes

Multimodal (vision) preserved in BF16.
Gated DeltaNet layers are a hybrid attention+SSM architecture — unique to Qwen3.5/3.6.
NVFP4 is Blackwell-specific. Will not work on Ampere/Hopper.
Use --kv-cache-dtype fp8 for 2x KV capacity at no quality cost.

Credits

Original model: Qwen
Quantization tool: llm-compressor
Quantized by: Lna-Lab

Downloads last month: 26,852

Model tree for sakamakismile/Qwen3.6-35B-A3B-NVFP4

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(248)

this model