Holo3 35B-A3B โ MXFP4 (MLX)
Open Compute Project MXFP4 quantization of H Company's Holo3 โ a 35 B / 3 B-active GUI-agent VLM finetuned from Qwen3.5-35B-A3B, with the vision tower preserved.
Model Details
| Property | Value |
|---|---|
| Base model | Hcompany/Holo3-35B-A3B (finetune of Qwen/Qwen3.5-35B-A3B) |
| Parameters (source) | 35 B total, โ 3 B active per token |
| Architecture | qwen3_5_moe โ 40 decoder layers: 30 Gated DeltaNet (linear) + 10 full-attention, 256 routed experts + 1 always-on shared expert |
| Quantization | OCP MXFP4 (E2M1 + shared E8M0 scale) at block 32, with 8-bit affine overrides on routers |
| Package size on disk | 19.32 GB across 5 shards |
| Shipped in this repo | 1,658 tensors total (1,325 language-model MXFP4 + 333 vision tower bf16) |
| Vocab | 248,320 |
| Context (position embeddings) | 262,144 native |
| Vision tower | 27-layer ViT (hidden 1152, patch 16), preserved in bf16 |
| Chat format | Qwen im_start/im_end with <think> reasoning toggle; Holo3 XML tool-call grammar |
| Use case | GUI / computer-use agent (desktop, web, mobile) โ designed for screenshot โ action loops |
Quantization details, per tensor category
| Category | Bits | Group | Notes |
|---|---|---|---|
Routed-expert MLP (mlp.switch_mlp.gate_proj, up_proj, down_proj) |
4 (MXFP4) | 32 | Bulk of parameters |
Embedding (embed_tokens), lm_head |
4 (MXFP4) | 32 | Default --q-mode mxfp4 applies |
Linear-attention projections (in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj) |
4 (MXFP4) | 32 | |
Full-attention projections (q_proj, k_proj, v_proj, o_proj) |
4 (MXFP4) | 32 | |
Shared-expert MLP (gate_proj, up_proj, down_proj) |
4 (MXFP4) | 32 | |
Router (mlp.gate) |
8 (affine) | 64 | 40 per-layer overrides, precision-critical |
Shared-expert gate (shared_expert_gate) |
8 (affine) | 64 | 40 per-layer overrides |
Norms (*_layernorm, *_norm), A_log, dt_bias, conv1d |
fp16 passthrough | โ | Kept un-quantized |
| Vision tower | bf16 passthrough | โ | 333 tensors |
The quantization map in config.json enumerates all 80 per-layer 8-bit overrides. MXFP4 is the open OCP Microscaling FP4 spec, distinct from NVIDIA's NVFP4 โ MLX exposes both as separate --q-mode values; this release uses mxfp4.
Architecture notes
- Hybrid attention stack: 30 of 40 layers use
Gated DeltaNet(linear-attention / delta-rule hybrid with a groupedconv1dinput path and per-headA_log/dt_biasstate โ constant memory in sequence length). The other 10 layers (one every 4, given byfull_attention_interval: 4) use full softmax attention withattn_output_gate: true(a sigmoid gate on the attention output beforeo_proj). - Partial rotary embeddings: only the first 25 % of head dim rotates (
partial_rotary_factor: 0.25, head_dim 256 โ 64 rotated),rope_theta = 1e7. Position metadata for mixed text/image/video (mrope_section: [11, 11, 10],mrope_interleaved: true) is preserved inconfig.json. - Mixture of Experts: 256 routed experts, 8 activated per token,
moe_intermediate_size: 512, plus 1 always-active shared expert (same 512 intermediate) gated bysigmoid(shared_expert_gate(x)). - Router: softmax-topk over expert logits โ not DeepSeek-style sigmoid +
e_score_correction_bias. - Vision tower: 27-layer ViT,
patch_size=16,temporal_patch_size=2,merge_size=2. Preprocessor isQwen3VLProcessor.
Usage
Requires mlx-lm >= 0.30.7 for the qwen3_5_moe text stack and mlx-vlm >= 0.4.4 for image input. This release was packaged with mlx-lm 0.31.2.
Text-only
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Holo3-35B-A3B-mxfp4")
print(generate(model, tokenizer,
prompt="The capital of France is",
max_tokens=32))
Reasoning toggle
Holo3 inherits the Qwen3.5 chat template with a <think> reasoning switch. Pass enable_thinking as a direct kwarg to apply_chat_template.
msgs = [{"role": "user", "content": "What is 17 ร 23?"}]
# Reasoning OFF โ inserts a pre-closed <think></think> block
prompt = tokenizer.apply_chat_template(
msgs, add_generation_prompt=True, enable_thinking=False
)
# Reasoning ON โ model fills the <think> block then answers
prompt = tokenizer.apply_chat_template(
msgs, add_generation_prompt=True, enable_thinking=True
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=256))
Vision (image input) โ the intended use
Holo3 is a GUI agent: give it a screenshot and it localizes UI elements and plans actions.
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
path = "OsaurusAI/Holo3-35B-A3B-mxfp4"
model, processor = load(path)
config = load_config(path)
prompt = apply_chat_template(
processor, config,
"Look at this desktop screenshot. Where should I click to open the settings?",
num_images=1,
)
print(generate(model, processor, prompt, "path/to/screenshot.png",
max_tokens=256))
Tool calls โ Holo3 XML format
Holo3 emits tool calls in a custom XML grammar (not JSON). Pass tools=[...] to the tokenizer's chat template; the model responds in this shape:
<tool_call>
<function=click>
<parameter=x>
512
</parameter>
<parameter=y>
384
</parameter>
</function>
</tool_call>
Parse with a simple XML splitter on <tool_call>. See H Company's quickstart for a full agent harness example.
Video: base supports video inference via transformers; the bundle preserves video_preprocessor_config.json plus video_token_id / vision_start_token_id / vision_end_token_id. mlx-vlm 0.4.4 does not yet expose a video path for qwen3_5_moe โ use transformers directly for video.
Audio: not supported (base model has no audio tower).
Hardware notes
19.32 GB weights on disk; once loaded, expect ~19โ22 GB resident plus KV cache. Full-attention KV grows with sequence length; linear-attention layers contribute a bounded per-layer SSM state (independent of context).
| Mac | Works? | Notes |
|---|---|---|
| 24 GB unified | โ text, short context | Leave a few GB for KV cache at โค 32 k tokens |
| 32 GB unified | โ comfortable | Full-attn KV for ~100 k tokens fits within budget |
| 64 GB+ unified | โ headroom for long context | 262 k native viable |
| 16 GB unified | โ ๏ธ | Text-only OK with tight context; image inference will be tight |
Upstream benchmarks
These are the upstream base-model numbers for Hcompany/Holo3-35B-A3B, not evaluations of this MXFP4 quant:
| Benchmark | Score |
|---|---|
| OSWorld-Verified (computer use) | 77.8 % โ SOTA at 3 B active |
| WebArena (web navigation) | State-of-the-art (see upstream card) |
| ScreenSpot-Pro (UI localization) | Top-tier (see upstream card) |
| OSWorld-G (visual grounding) | Top-tier (see upstream card) |
| H Corporate Benchmark (486 enterprise tasks) | Outperforms larger competitors |
Independent MXFP4 evaluation will be added as it is produced.
Citation
@misc{hai2025holo3modelfamily,
title = {Holo3 - Open Foundation Models for Navigation and Computer Use Agents},
author = {H Company},
year = {2026},
url = {https://huggingface.co/Hcompany/Holo3-35B-A3B}
}
License
Apache 2.0 โ inherits from the base model.
Packaged on Apple Silicon with mlx-lm 0.31.2 by Jinho Jang (eric@jangq.ai).
ยฉ 2026 Osaurus AI โ osaurus.ai
- Downloads last month
- 28
4-bit
Model tree for OsaurusAI/Holo3-35B-A3B-mxfp4
Base model
Qwen/Qwen3.5-35B-A3B-Base