OsaurusAI/Holo3-35B-A3B-mxfp4

Holo3 35B-A3B — MXFP4 (MLX)

Open Compute Project MXFP4 quantization of H Company's Holo3 — a 35 B / 3 B-active GUI-agent VLM finetuned from Qwen3.5-35B-A3B, with the vision tower preserved.

Model Details

Property	Value
Base model	`Hcompany/Holo3-35B-A3B` (finetune of `Qwen/Qwen3.5-35B-A3B`)
Parameters (source)	35 B total, ≈ 3 B active per token
Architecture	`qwen3_5_moe` — 40 decoder layers: 30 `Gated DeltaNet` (linear) + 10 full-attention, 256 routed experts + 1 always-on shared expert
Quantization	OCP MXFP4 (E2M1 + shared E8M0 scale) at block 32, with 8-bit affine overrides on routers
Package size on disk	19.32 GB across 5 shards
Shipped in this repo	1,658 tensors total (1,325 language-model MXFP4 + 333 vision tower bf16)
Vocab	248,320
Context (position embeddings)	262,144 native
Vision tower	27-layer ViT (hidden 1152, patch 16), preserved in bf16
Chat format	Qwen `im_start`/`im_end` with `<think>` reasoning toggle; Holo3 XML tool-call grammar
Use case	GUI / computer-use agent (desktop, web, mobile) — designed for screenshot → action loops

Quantization details, per tensor category

Category	Bits	Group	Notes
Routed-expert MLP (`mlp.switch_mlp.gate_proj`, `up_proj`, `down_proj`)	4 (MXFP4)	32	Bulk of parameters
Embedding (`embed_tokens`), `lm_head`	4 (MXFP4)	32	Default `--q-mode mxfp4` applies
Linear-attention projections (`in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`, `out_proj`)	4 (MXFP4)	32
Full-attention projections (`q_proj`, `k_proj`, `v_proj`, `o_proj`)	4 (MXFP4)	32
Shared-expert MLP (`gate_proj`, `up_proj`, `down_proj`)	4 (MXFP4)	32
Router (`mlp.gate`)	8 (affine)	64	40 per-layer overrides, precision-critical
Shared-expert gate (`shared_expert_gate`)	8 (affine)	64	40 per-layer overrides
Norms (`_layernorm`, `_norm`), `A_log`, `dt_bias`, `conv1d`	fp16 passthrough	—	Kept un-quantized
Vision tower	bf16 passthrough	—	333 tensors

The quantization map in config.json enumerates all 80 per-layer 8-bit overrides. MXFP4 is the open OCP Microscaling FP4 spec, distinct from NVIDIA's NVFP4 — MLX exposes both as separate --q-mode values; this release uses mxfp4.

Architecture notes

Hybrid attention stack: 30 of 40 layers use Gated DeltaNet (linear-attention / delta-rule hybrid with a grouped conv1d input path and per-head A_log / dt_bias state — constant memory in sequence length). The other 10 layers (one every 4, given by full_attention_interval: 4) use full softmax attention with attn_output_gate: true (a sigmoid gate on the attention output before o_proj).
Partial rotary embeddings: only the first 25 % of head dim rotates (partial_rotary_factor: 0.25, head_dim 256 → 64 rotated), rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section: [11, 11, 10], mrope_interleaved: true) is preserved in config.json.
Mixture of Experts: 256 routed experts, 8 activated per token, moe_intermediate_size: 512, plus 1 always-active shared expert (same 512 intermediate) gated by sigmoid(shared_expert_gate(x)).
Router: softmax-topk over expert logits — not DeepSeek-style sigmoid + e_score_correction_bias.
Vision tower: 27-layer ViT, patch_size=16, temporal_patch_size=2, merge_size=2. Preprocessor is Qwen3VLProcessor.

Usage

Requires mlx-lm >= 0.30.7 for the qwen3_5_moe text stack and mlx-vlm >= 0.4.4 for image input. This release was packaged with mlx-lm 0.31.2.

Text-only

from mlx_lm import load, generate

model, tokenizer = load("OsaurusAI/Holo3-35B-A3B-mxfp4")
print(generate(model, tokenizer,
               prompt="The capital of France is",
               max_tokens=32))

Reasoning toggle

Holo3 inherits the Qwen3.5 chat template with a <think> reasoning switch. Pass enable_thinking as a direct kwarg to apply_chat_template.

msgs = [{"role": "user", "content": "What is 17 × 23?"}]

# Reasoning OFF — inserts a pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(
    msgs, add_generation_prompt=True, enable_thinking=False
)
# Reasoning ON — model fills the <think> block then answers
prompt = tokenizer.apply_chat_template(
    msgs, add_generation_prompt=True, enable_thinking=True
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=256))

Vision (image input) — the intended use

Holo3 is a GUI agent: give it a screenshot and it localizes UI elements and plans actions.

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

path = "OsaurusAI/Holo3-35B-A3B-mxfp4"
model, processor = load(path)
config = load_config(path)

prompt = apply_chat_template(
    processor, config,
    "Look at this desktop screenshot. Where should I click to open the settings?",
    num_images=1,
)
print(generate(model, processor, prompt, "path/to/screenshot.png",
               max_tokens=256))

Tool calls — Holo3 XML format

Holo3 emits tool calls in a custom XML grammar (not JSON). Pass tools=[...] to the tokenizer's chat template; the model responds in this shape:

<tool_call>
<function=click>
<parameter=x>
512
</parameter>
<parameter=y>
384
</parameter>
</function>
</tool_call>

Parse with a simple XML splitter on <tool_call>. See H Company's quickstart for a full agent harness example.

Video: base supports video inference via transformers; the bundle preserves video_preprocessor_config.json plus video_token_id / vision_start_token_id / vision_end_token_id. mlx-vlm 0.4.4 does not yet expose a video path for qwen3_5_moe — use transformers directly for video.

Audio: not supported (base model has no audio tower).

Hardware notes

19.32 GB weights on disk; once loaded, expect ~19–22 GB resident plus KV cache. Full-attention KV grows with sequence length; linear-attention layers contribute a bounded per-layer SSM state (independent of context).

Mac	Works?	Notes
24 GB unified	✅ text, short context	Leave a few GB for KV cache at ≤ 32 k tokens
32 GB unified	✅ comfortable	Full-attn KV for ~100 k tokens fits within budget
64 GB+ unified	✅ headroom for long context	262 k native viable
16 GB unified	⚠️	Text-only OK with tight context; image inference will be tight

Upstream benchmarks

These are the upstream base-model numbers for Hcompany/Holo3-35B-A3B, not evaluations of this MXFP4 quant:

Benchmark	Score
OSWorld-Verified (computer use)	77.8 % — SOTA at 3 B active
WebArena (web navigation)	State-of-the-art (see upstream card)
ScreenSpot-Pro (UI localization)	Top-tier (see upstream card)
OSWorld-G (visual grounding)	Top-tier (see upstream card)
H Corporate Benchmark (486 enterprise tasks)	Outperforms larger competitors

Independent MXFP4 evaluation will be added as it is produced.

Citation

@misc{hai2025holo3modelfamily,
      title  = {Holo3 - Open Foundation Models for Navigation and Computer Use Agents},
      author = {H Company},
      year   = {2026},
      url    = {https://huggingface.co/Hcompany/Holo3-35B-A3B}
}

License

Apache 2.0 — inherits from the base model.

Downloads last month: 28

Safetensors

Model size

35B params

Tensor type

U32

F32

BF16

MLX

Hardware compatibility

4-bit

Model tree for OsaurusAI/Holo3-35B-A3B-mxfp4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Finetuned

Hcompany/Holo3-35B-A3B

Quantized

(7)

this model