Qwen3.6-35B-A3B-TQ3-native
A native TQ3 checkpoint of Qwen/Qwen3.6-35B-A3B (released 2026-04-16). 35B total parameters, 3B active per token, compressed to 3 bits per weight via TurboQuant — the HIGGS scalar quantization scheme (Walsh-Hadamard rotation + Lloyd-Max codebook).
Why this checkpoint exists
The original Qwen/Qwen3.6-35B-A3B is ~70 GB at bfloat16 and will not load on a 48 GB Apple Silicon Mac. This checkpoint is ~16.5 GB and runs on a 48 GB M4 Pro with comfortable headroom.
| Format | Disk | Runtime resident | Fits on 48 GB Mac? |
|---|---|---|---|
Qwen/Qwen3.6-35B-A3B (bfloat16) |
~70 GB | >70 GB | ❌ |
| This checkpoint (TQ3 native) | ~16 GB | ~20-22 GB | ✅ |
TQ3 and MoE compress different axes: MoE sparsifies which experts fire per token (3B active / 35B total), TQ3 compresses the bit width of every stored weight. Stacking them gives you "a 35B-class MoE that fits and runs on a laptop."
How to use
MLX (Apple Silicon)
Install turboquant-vllm — the loader is MLX-compatible and registers with mlx_lm:
pip install git+https://github.com/varjoranta/turboquant-vllm.git
huggingface-cli download varjosoft/Qwen3.6-35B-A3B-TQ3-native --local-dir ~/models/qwen3.6-35b-a3b-tq3
Serve through the standard mlx_lm OpenAI-compatible server with a small shim:
python3 -m turboquant_vllm.mlx_serve --model ~/models/qwen3.6-35b-a3b-tq3 --port 8080
# or a bare generate:
python3 -c "
from turboquant_vllm.mlx_loader import load_tq3
import mlx.core as mx
from mlx_lm.generate import generate
model, tok = load_tq3('$HOME/models/qwen3.6-35b-a3b-tq3')
mx.eval(model.parameters())
print(generate(model, tok, prompt='The capital of Finland is', max_tokens=64))
"
vLLM / CUDA (GPUs)
Use the turboquant-plus-vllm plugin or serve through the upstream --quantization turboquant path once vLLM PR #39970 and its MoE follow-up land.
# Plugin (available today):
pip install turboquant-plus-vllm
vllm serve varjosoft/Qwen3.6-35B-A3B-TQ3-native
What TQ3 actually does
- Walsh-Hadamard rotation on each group of 128 weights decorrelates them, so a shared grid fits well. Performed with two random sign vectors + in-place FWHT (
O(n log n)). - Lloyd-Max codebook with 8 entries (
2^3 = 3 bits) is the MSE-optimal scalar grid for a unit Gaussian (post-rotation distribution). - Shape-gain norm per group:
original_norm / reconstruction_norm(classical VQ technique, Gray 1984) — ≈2× lower reconstruction error than storing raw L2.
This is the scalar case of HIGGS (Malinovskii et al., NAACL 2025, arXiv:2411.17525). The turboquant name is kept for API / plugin-package compatibility; the implementation converged onto HIGGS during practical simplification.
Measured quality and throughput
On a MacBook Pro (M4 Pro, 48 GB unified memory)
Loaded via turboquant-vllm's MLX loader. Source: mac-mlx-bench.py + mac-gsm8k-mlx.py in the reproduce bundle.
| Metric | Value |
|---|---|
| Cold load | 4.3 s |
| Resident memory (after first prefill) | 17.2 GB (on a 48 GB laptop) |
| Prefill throughput | 80 tok/s (linear scaling up to 2 K prompt tokens) |
| Decode throughput bs=1 | 36.15 tok/s |
| gsm8k 5-shot CoT, 50-example subsample | 43 / 50 = 86.0 % (scoring edge case → true ≈ 88 %) |
On a single A100-80GB (via vLLM)
Via the feat/turboquant-moe branch of vLLM. Phase-1 Python-level dequant path — the fused-kernel CUDA follow-up PR will replace this.
| Metric | Value |
|---|---|
| Model GPU memory after TQ3 compression | 16.48 GiB (4.25× compression from ~70 GB bf16) |
| Decode throughput bs=1 | 0.14 tok/s |
| gsm8k 5-shot CoT, 50-example | 38 / 50 = 76.0 % raw; 88.4 % on the 43 completed requests (7 client-side timeouts counted wrong) |
Coherence smoke (both environments)
"The capital of Finland is"→ Helsinki + correct factoids ✓"2 + 2 equals"→4with correct reasoning ✓- 2-train catch-up word problem → correct CoT set-up, correct answer ✓
Files
Standard HuggingFace format. Each weight that was compressed has two companion tensors:
| Original key | Added keys |
|---|---|
model.layers.N.self_attn.q_proj.weight |
...q_proj.weight.tq_packed (uint8), ...q_proj.weight.tq_norms (float32) |
model.layers.N.mlp.experts.E.gate_proj.weight |
...gate_proj.weight.tq_packed, ...gate_proj.weight.tq_norms |
| … | … |
Embeddings, layernorms, biases, and the final LM head are stored uncompressed.
tq_config.json captures the quantizer state (bits=3, group_size=128, quantizer_seed=42) so loaders can rebuild the identical Walsh-Hadamard rotation + codebook deterministically.
Reproduce
git clone https://github.com/varjoranta/turboquant-vllm
cd turboquant-vllm
uv venv --python 3.13 && uv pip install -e . accelerate transformers>=5.5
python3 scripts/publish_model.py compress Qwen/Qwen3.6-35B-A3B ./qwen3.6-tq3 --bits 3
python3 scripts/publish_model.py upload ./qwen3.6-tq3 <your-repo>
Needs ≥ 100 GB CPU RAM (full bf16 model loaded once during compression). Inference needs none of that — ~20 GB resident is enough.
Citations
If this checkpoint helps your work, please cite both the base model and the HIGGS paper:
@article{qwen2026qwen36,
title={Qwen3.6-35B-A3B}, author={Qwen Team, Alibaba}, year={2026},
url={https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
@inproceedings{malinovskii2025higgs,
title={HIGGS: Pushing the Limits of Large Language Model Quantization via
Hadamard Rotations and MSE-Optimal Grids},
author={Malinovskii, Vladimir and Mazur, Andrei and Ilin, Ivan and Kuznedelev,
Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and
Richtarik, Peter},
booktitle={NAACL}, year={2025},
url={https://aclanthology.org/2025.naacl-long.543/}
}
License
Inherits Apache-2.0 from the base model.
Links
- Compression / loader code:
varjoranta/turboquant-vllm - vLLM upstream PR: vllm-project/vllm#39970
- Downloads last month
- 2,366
8-bit
Model tree for varjosoft/Qwen3.6-35B-A3B-TQ3-native
Base model
Qwen/Qwen3.6-35B-A3B