arxiv:2602.05711

OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

Published on Feb 5

· Submitted by

Loser Cheems on Feb 9

Beijing Academy of Artificial Intelligence

Upvote

Authors:

Jingze Shi ,

Abstract

OmniMoE presents a system-algorithm co-designed framework that achieves fine-grained expert specialization in Mixture-of-Experts architectures through vector-level atomic experts and optimized routing and scheduling mechanisms.

AI-generated summary

Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.

View arXiv page View PDF GitHub 52 Add to collection

Community

JingzeShi

Paper author Paper submitter about 12 hours ago

Hi everyone,

We're excited to share our new paper, OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale!

Mixture-of-Experts (MoE) models often face a tough trade-off between expert granularity and hardware efficiency. In this work, we push expert granularity to its logical extreme with vector-level Atomic Experts. Our system-algorithm co-design, featuring a Cartesian Product Router and Expert-Centric Scheduling, makes it possible to manage this massive expert space efficiently.

The result is a model that is both highly specialized and incredibly fast. OmniMoE achieves a 10.9x inference speedup over strong fine-grained MoE baselines like PEER, while also outperforming them in zero-shot accuracy.

We believe this work shows that massive-scale, fine-grained MoE can be both fast and accurate, opening up new possibilities for efficient and powerful models.

Paper: https://arxiv.org/abs/2602.05711
Code: https://github.com/flash-algo/omni-moe

We'd love to hear your feedback and answer any questions you might have!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05711 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05711 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05711 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.