Osaurus AI

Qwen 3.6 35B-A3B — MXFP4 (MLX)

Open Compute Project MXFP4 quantization of Alibaba's hybrid linear/full attention agentic MoE, with the vision tower preserved.

Website  OsaurusAI


Model Details

Property Value
Base model Qwen/Qwen3.6-35B-A3B
Parameters (source) 35 B total, ≈ 3 B active per token
Architecture qwen3_5_moe — 40 decoder layers: 30 Gated DeltaNet (linear) + 10 full-attention, 256 routed experts + 1 always-on shared expert
Quantization OCP MXFP4 (E2M1 + shared E8M0 scale) at block 32, with 8-bit affine overrides on routers
Package size on disk 19.32 GB across 5 shards
Shipped in this repo 1,658 tensors total (1,325 language-model + 333 vision tower)
Vocab 248,320
Context (position embeddings) 262,144 native; the upstream model card reports up to ~1 M with YaRN scaling
Vision tower 27-layer ViT (hidden 1152, patch 16), preserved in fp16
Chat format Qwen im_start/im_end, unified thinking toggle

Quantization details, per tensor category

Category Bits Group Notes
Routed-expert MLP (mlp.experts.gate_up_proj, down_proj) 4 (MXFP4) 32 Bulk of parameters
Embedding (embed_tokens), lm_head 4 (MXFP4) 32 Default --q-mode mxfp4 applies
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) 4 (MXFP4) 32
Full-attention projections (q_proj, k_proj, v_proj, o_proj) 4 (MXFP4) 32
Shared-expert MLP (gate_proj, up_proj, down_proj) 4 (MXFP4) 32
Router (mlp.gate) 8 (affine) 64 40 per-layer overrides, precision-critical
Shared-expert gate (shared_expert_gate) 8 (affine) 64 40 per-layer overrides
Norms (*_layernorm, *_norm), A_log, dt_bias, conv1d fp16 passthrough Kept un-quantized
Vision tower fp16 passthrough 333 tensors, patch-embed axes pre-transposed to MLX layout

The quantization map in config.json enumerates all 80 per-layer 8-bit overrides. MXFP4 is the open OCP Microscaling FP4 spec, distinct from NVIDIA's NVFP4 — MLX exposes both as separate --q-mode values; this release uses mxfp4.


Architecture notes — what's new vs Qwen 3 / 3.5

  • Hybrid attention stack: 30 of 40 layers use Gated DeltaNet, a linear-attention / delta-rule hybrid with a grouped conv1d input path and per-head A_log / dt_bias state — constant memory in sequence length. The other 10 layers (one every 4, given by full_attention_interval: 4) use full softmax attention with attn_output_gate: true (a sigmoid gate on the attention output before o_proj).
  • Partial rotary embeddings: only the first 25% of head dim rotates (partial_rotary_factor: 0.25, head_dim 256 → 64 rotated), rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section: [11, 11, 10], mrope_interleaved: true) is preserved in config.json.
  • Mixture of Experts: 256 routed experts, 8 activated per token, moe_intermediate_size: 512, plus a 1 always-active shared expert (same 512 intermediate) gated by sigmoid(shared_expert_gate(x)).
  • Router: softmax-topk over expert logits — not DeepSeek-style sigmoid + e_score_correction_bias.

Usage

Requires mlx-lm >= 0.30.7 for the qwen3_5_moe text stack and mlx-vlm >= 0.4.4 for image input. This release was packaged with mlx-lm 0.31.2.

Text-only

pip install 'mlx-lm>=0.30.7'
from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/Qwen3.6-35B-A3B-mxfp4")
print(generate(model, tokenizer,
               prompt="The capital of France is",
               max_tokens=32))

CLI

mlx_lm.generate --model OsaurusAI/Qwen3.6-35B-A3B-mxfp4 \
                --prompt "Write a haiku about Apple Silicon" \
                --max-tokens 100

Reasoning toggle

Qwen 3.6 is a single model with a chat-template-driven thinking switch. Pass enable_thinking as a direct kwarg to apply_chat_template — the chat_template_kwargs dict form only propagates on some tokenizer versions.

msgs = [{"role": "user", "content": "What is 17 × 23?"}]

# Reasoning OFF — inserts a pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(
    msgs, add_generation_prompt=True, enable_thinking=False
)
# Reasoning ON — model fills the <think> block then answers
prompt = tokenizer.apply_chat_template(
    msgs, add_generation_prompt=True, enable_thinking=True
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=256))

Vision (image input)

pip install 'mlx-vlm>=0.4.4'
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

path = "OsaurusAI/Qwen3.6-35B-A3B-mxfp4"
model, processor = load(path)
config = load_config(path)

prompt = apply_chat_template(processor, config,
                             "Describe this image.", num_images=1)
print(generate(model, processor, prompt, "path/to/image.jpg",
               max_tokens=200))

Verified: text, reasoning on/off, single-image VL all produce coherent output on this quant.

Video: the base model supports video inference via transformers, and the bundle preserves video_preprocessor_config.json plus video_token_id / vision_start_token_id / vision_end_token_id. However mlx-vlm 0.4.4 does not expose a video path for qwen3_5_moe — use transformers directly for video.

Audio: the base model is not an Omni variant — no audio path.


Hardware notes

19.32 GB weights on disk; once loaded, expect ~19–22 GB resident plus KV cache. Full-attention KV grows with sequence length; linear-attention layers contribute a bounded per-layer SSM state (independent of context).

Mac Works? Notes
24 GB unified ✅ text, short context Leave a few GB for KV cache at ≤ 32 k tokens
32 GB unified ✅ comfortable Full-attn KV for ~100 k tokens fits within budget
64 GB+ unified ✅ headroom for long context 262 k native viable; YaRN past that is possible but pushes KV into tens of GB
16 GB unified ⚠️ Text-only OK with tight context; image inference will be tight

Upstream benchmarks

These are the upstream base-model numbers (Qwen/Qwen3.6-35B-A3B), not evaluations of this MXFP4 quant:

Benchmark Score
MMLU-Pro 85.2
AIME 2026 92.7
LiveCodeBench v6 80.4
GPQA 86.0
SWE-bench Verified 73.4

Independent MXFP4 evaluation will be added here as it is produced.


Citation

@misc{qwen2026qwen36,
  title  = {Qwen3.6-Plus: Towards Real World Agents},
  author = {Qwen Team},
  year   = {2026},
  url    = {https://qwen.ai/blog?id=qwen3.6}
}

License

Apache 2.0 — inherits from the base model.


Packaged on Apple Silicon with mlx-lm 0.31.2.
© 2026 Osaurus AI — osaurus.ai

Downloads last month
2,346
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/Qwen3.6-35B-A3B-mxfp4

Quantized
(248)
this model