Marco-Mini-Instruct

Marco-Mini-Instruct is the instruction-tuned variant of Marco-Mini-Base, a highly sparse Mixture-of-Experts (MoE) multilingual language model from the Marco-MoE family, developed by Alibaba International Digital Commerce. It activates only 0.86B out of 17.3B total parameters (5% activation ratio) per token. Marco-Mini-Instruct achieves the best average performance across English, multilingual general, and multilingual cultural benchmarks when compared against instruct models with up to 12B activated parameters, including Qwen3-4B-Instruct, Ministral3-8B-Instruct, Gemma3-12B-Instruct, LFM2-24B-A2B, and Granite4-Small-Instruct.

Model Description

Marco-Mini-Instruct shares the same architecture as Marco-Mini-Base: a decoder-only Transformer with sparse MoE layers replacing standard FFN layers, upcycled from Qwen3-0.6B-Base using fine-grained sub-matrix splitting combined with Drop-Upcycling.

Configuration	Value
Total Parameters	17.3B
Activated Parameters	0.86B
Activation Ratio	5%
Num Layers	28
Model Dimension	1024
FFN Intermediate Dimension	3072
Q-Heads	16
KV-Heads	8
Head Dimension	128
Expert Dimension	768
Total Experts	256
Activated Experts	8
Tie Embeddings	True
Training FLOPs	$1.56 \times 10^{23}$

Post-Training Details

Marco-Mini-Instruct is trained from Marco-Mini-Base using a two-stage post-training pipeline implemented with the SLIME framework:

Stage 1: Supervised Fine-Tuning (SFT)

Duration: ~24 hours on 64 GPUs
Steps: ~4,000 (1 epoch)
Learning rate: 1e-5 with cosine decay to 1e-6
Batch size: 512, context length 8,192 tokens

Data sources:

General instructions — Dolci-Instruct dataset, augmented with Nemotron-Cascade-2 data
Knowledge-intensive data — Scientific prompts from Nemotron-Cascade-2, responses distilled from Gemini3-Flash
Translation data — Web-mined NLLB translation pairs, filtered and scored with Qwen3-Embedding-8B (top 10K per language)
Multilingual & cultural data — Wikidata-sourced content with Gemini3-Flash text synthesis for cultural concepts.

Stage 2: On-Policy Distillation (OPD)

Duration: ~110 hours on 64 GPUs
Steps: ~3,800 total (2 responses sampled per prompt)
Learning rate: 1e-6 (constant)

Cascaded distillation:

~1,900 steps with Qwen3-30B-A3B-Instruct as teacher
~1,900 steps with Qwen3-Next-80B-A3B-Instruct as stronger teacher

OPD data mixture:

Category	Datasets	Ratio
Instruction Following	Nemotron-RL-instruction-following + structured outputs	25%
Knowledge & Reasoning	Nemotron-RL-ReasoningGym-v1 + knowledge-mcqa	25%
Alignment	Nemotron-Cascade-RL-RLHF	10%
Math	DAPO-Math-17k + Skywork-OR1-RL-Data	10%
Multilingual	Translation + Cultural + Nemotron-SFT-Multilingual-v1	30%

Supported Languages

English, Chinese, Arabic, German, Spanish, French, Korean, Japanese, Portuguese, Turkish, Indonesian, Italian, Dutch, Polish, Russian, Vietnamese, Thai, Hebrew, Ukrainian, Malay, Bengali, Czech, Urdu, Kazakh, Greek, Romanian, Hungarian, Nepali, Azerbaijani

Evaluation

We compare Marco-Mini-Instruct against strong instruct baselines: Qwen3-4B-Instruct (4B activated), Ministral3-8B-Instruct (8.8B activated), Gemma3-12B-Instruct (12B activated), Granite4-Small-Instruct (9B activated), and LFM2-24B-A2B (2B activated). Marco-Mini-Instruct uses only 0.86B activated parameters. Avg@8 accuracies are reported, except for GlobalMMLU and MMMLU where Acc@1 is reported.

English

Benchmark	Qwen3-4B	Ministral3-8B	Gemma3-12B	Granite4-Small	LFM2-24B-A2B	Marco-Mini
MMLU (Acc)	80.8	79.8	76.2	76.7	74.9	83.4
MMLU-Redux (Acc)	80.9	79.9	76.2	76.7	74.9	83.5
MMLU-Pro (Acc)	66.9	63.9	55.8	57.1	57.6	70.7
AGIEval (Acc)	51.7	52.4	43.6	44.7	49.0	55.4
GPQA-Diamond (Acc)	50.8	44.8	35.2	38.6	39.7	50.3
GSM8K (EM)	88.6	89.5	89.7	83.9	87.2	93.1
MATH (EM)	93.4	86.2	83.8	75.7	83.9	91.8
Average	73.3	70.9	65.8	64.8	66.7	75.5

Multilingual — General

Benchmark	Qwen3-4B	Ministral3-8B	Gemma3-12B	Granite4-Small	LFM2-24B-A2B	Marco-Mini
GlobalMMLU (Acc)	70.2	55.4	69.2	67.4	57.0	73.3
MMMLU (Acc)	71.3	56.4	69.4	68.1	62.3	73.7
MMLU-ProX-Lite (Acc)	58.3	43.3	51.3	51.6	43.3	61.2
MGPQA (Acc)	41.0	30.5	32.8	35.0	32.7	41.8
FLORES-200 En→Xx (BLEU)	22.1	17.5	35.6	31.9	19.2	30.6
FLORES-200 Xx→En (BLEU)	33.5	31.0	40.3	32.2	22.7	36.8
WMT24++ En→Xx (BLEU)	20.9	14.4	32.1	26.6	16.0	26.8
WMT24++ Xx→En (BLEU)	29.9	24.2	35.5	27.5	18.8	31.3
MGSM (EM)	84.4	68.7	84.0	75.7	67.8	87.4
PolyMath (EM)	47.2	26.4	35.5	28.9	29.3	44.7
Average	47.9	36.8	48.6	44.5	36.9	50.8

Multilingual — Cultural & Regional

Benchmark	Qwen3-4B	Ministral3-8B	Gemma3-12B	Granite4-Small	LFM2-24B-A2B	Marco-Mini
INCLUDE (Acc)	63.8	50.7	65.0	60.3	49.1	65.6
Global-PIQA (Acc)	79.6	61.3	82.2	80.2	69.0	84.2
CMMLU (Acc)	78.6	67.4	60.8	59.6	56.7	75.3
C-Eval (Acc)	80.4	68.0	59.7	59.4	56.7	75.4
ArabicMMLU (Acc)	66.0	41.4	70.1	66.3	61.3	67.8
TurkishMMLU (Acc)	71.6	48.2	64.4	57.9	33.4	74.7
GreekMMLU (Acc)	68.6	49.5	77.7	71.7	44.7	72.5
KazakhMMLU (Acc)	66.6	59.1	66.8	63.5	47.6	68.8
IndoMMLU (Acc)	64.4	52.4	65.3	59.6	42.7	65.7
IndoCareer (Acc)	62.2	53.4	63.2	56.3	43.7	64.4
IndoCulture (Acc)	58.7	47.8	69.6	59.3	44.2	67.1
Average	69.1	54.5	67.7	63.1	49.9	71.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AIDC-AI/Marco-Mini-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Citation

@article{marco-moe,
  title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},
  author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang and Weihua Luo},
  year={2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month: -

Safetensors

Model size

17B params

Tensor type

BF16

Datasets used to train AIDC-AI/Marco-Mini-Instruct

Collection including AIDC-AI/Marco-Mini-Instruct

Marco-MoE

Collection

A suit of multilingual MoE models with highly-sparse architectures • 5 items • Updated about 5 hours ago • 10