Qwen3.6-35B-A3B-TQ3-native

A native TQ3 checkpoint of Qwen/Qwen3.6-35B-A3B (released 2026-04-16). 35B total parameters, 3B active per token, compressed to 3 bits per weight via TurboQuant — the HIGGS scalar quantization scheme (Walsh-Hadamard rotation + Lloyd-Max codebook).

Why this checkpoint exists

The original Qwen/Qwen3.6-35B-A3B is ~70 GB at bfloat16 and will not load on a 48 GB Apple Silicon Mac. This checkpoint is ~16.5 GB and runs on a 48 GB M4 Pro with comfortable headroom.

Format Disk Runtime resident Fits on 48 GB Mac?
Qwen/Qwen3.6-35B-A3B (bfloat16) ~70 GB >70 GB
This checkpoint (TQ3 native) ~16 GB ~20-22 GB

TQ3 and MoE compress different axes: MoE sparsifies which experts fire per token (3B active / 35B total), TQ3 compresses the bit width of every stored weight. Stacking them gives you "a 35B-class MoE that fits and runs on a laptop."

How to use

MLX (Apple Silicon)

Install turboquant-vllm — the loader is MLX-compatible and registers with mlx_lm:

pip install git+https://github.com/varjoranta/turboquant-vllm.git
huggingface-cli download varjosoft/Qwen3.6-35B-A3B-TQ3-native --local-dir ~/models/qwen3.6-35b-a3b-tq3

Serve through the standard mlx_lm OpenAI-compatible server with a small shim:

python3 -m turboquant_vllm.mlx_serve --model ~/models/qwen3.6-35b-a3b-tq3 --port 8080
# or a bare generate:
python3 -c "
from turboquant_vllm.mlx_loader import load_tq3
import mlx.core as mx
from mlx_lm.generate import generate
model, tok = load_tq3('$HOME/models/qwen3.6-35b-a3b-tq3')
mx.eval(model.parameters())
print(generate(model, tok, prompt='The capital of Finland is', max_tokens=64))
"

vLLM / CUDA (GPUs)

Use the turboquant-plus-vllm plugin or serve through the upstream --quantization turboquant path once vLLM PR #39970 and its MoE follow-up land.

# Plugin (available today):
pip install turboquant-plus-vllm
vllm serve varjosoft/Qwen3.6-35B-A3B-TQ3-native

What TQ3 actually does

  • Walsh-Hadamard rotation on each group of 128 weights decorrelates them, so a shared grid fits well. Performed with two random sign vectors + in-place FWHT (O(n log n)).
  • Lloyd-Max codebook with 8 entries (2^3 = 3 bits) is the MSE-optimal scalar grid for a unit Gaussian (post-rotation distribution).
  • Shape-gain norm per group: original_norm / reconstruction_norm (classical VQ technique, Gray 1984) — ≈2× lower reconstruction error than storing raw L2.

This is the scalar case of HIGGS (Malinovskii et al., NAACL 2025, arXiv:2411.17525). The turboquant name is kept for API / plugin-package compatibility; the implementation converged onto HIGGS during practical simplification.

Measured quality and throughput

On a MacBook Pro (M4 Pro, 48 GB unified memory)

Loaded via turboquant-vllm's MLX loader. Source: mac-mlx-bench.py + mac-gsm8k-mlx.py in the reproduce bundle.

Metric Value
Cold load 4.3 s
Resident memory (after first prefill) 17.2 GB (on a 48 GB laptop)
Prefill throughput 80 tok/s (linear scaling up to 2 K prompt tokens)
Decode throughput bs=1 36.15 tok/s
gsm8k 5-shot CoT, 50-example subsample 43 / 50 = 86.0 % (scoring edge case → true ≈ 88 %)

On a single A100-80GB (via vLLM)

Via the feat/turboquant-moe branch of vLLM. Phase-1 Python-level dequant path — the fused-kernel CUDA follow-up PR will replace this.

Metric Value
Model GPU memory after TQ3 compression 16.48 GiB (4.25× compression from ~70 GB bf16)
Decode throughput bs=1 0.14 tok/s
gsm8k 5-shot CoT, 50-example 38 / 50 = 76.0 % raw; 88.4 % on the 43 completed requests (7 client-side timeouts counted wrong)

Coherence smoke (both environments)

  • "The capital of Finland is" → Helsinki + correct factoids ✓
  • "2 + 2 equals"4 with correct reasoning ✓
  • 2-train catch-up word problem → correct CoT set-up, correct answer ✓

Files

Standard HuggingFace format. Each weight that was compressed has two companion tensors:

Original key Added keys
model.layers.N.self_attn.q_proj.weight ...q_proj.weight.tq_packed (uint8), ...q_proj.weight.tq_norms (float32)
model.layers.N.mlp.experts.E.gate_proj.weight ...gate_proj.weight.tq_packed, ...gate_proj.weight.tq_norms

Embeddings, layernorms, biases, and the final LM head are stored uncompressed.

tq_config.json captures the quantizer state (bits=3, group_size=128, quantizer_seed=42) so loaders can rebuild the identical Walsh-Hadamard rotation + codebook deterministically.

Reproduce

git clone https://github.com/varjoranta/turboquant-vllm
cd turboquant-vllm
uv venv --python 3.13 && uv pip install -e . accelerate transformers>=5.5
python3 scripts/publish_model.py compress Qwen/Qwen3.6-35B-A3B ./qwen3.6-tq3 --bits 3
python3 scripts/publish_model.py upload ./qwen3.6-tq3 <your-repo>

Needs ≥ 100 GB CPU RAM (full bf16 model loaded once during compression). Inference needs none of that — ~20 GB resident is enough.

Citations

If this checkpoint helps your work, please cite both the base model and the HIGGS paper:

@article{qwen2026qwen36,
  title={Qwen3.6-35B-A3B}, author={Qwen Team, Alibaba}, year={2026},
  url={https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
@inproceedings{malinovskii2025higgs,
  title={HIGGS: Pushing the Limits of Large Language Model Quantization via
         Hadamard Rotations and MSE-Optimal Grids},
  author={Malinovskii, Vladimir and Mazur, Andrei and Ilin, Ivan and Kuznedelev,
          Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and
          Richtarik, Peter},
  booktitle={NAACL}, year={2025},
  url={https://aclanthology.org/2025.naacl-long.543/}
}

License

Inherits Apache-2.0 from the base model.

Links

Downloads last month
2,366
Safetensors
Model size
14B params
Tensor type
F32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for varjosoft/Qwen3.6-35B-A3B-TQ3-native

Finetuned
(78)
this model

Paper for varjosoft/Qwen3.6-35B-A3B-TQ3-native