DeepSeek-V4-Flash-JANGTQ2

Uniform 2-bit MXTQ TurboQuant baseline of DeepSeek-V4-Flash. 79.6 GB. 70.00% MMLU 200q logit @ 22.3 tok/s on M3 Ultra.

Built with jang_tools for Apple Silicon (MLX). Verified on Mac Studio M3 Ultra.

The canonical baseline tier in the JANG family — uniform 2-bit MXTQ codec on all routed experts (no per-importance per-layer plan). Simpler recipe than JANGTQ premium, matches its quality at fair seed (within 0.5pp) at near-identical size.

Recipe

Tensor class	Bits	Codec	Notes
Routed experts (all 256 × 43 layers, uniform)	2-bit	MXTQ codebook	Lloyd-Max codebook + Hadamard rotation
Attention (`wq_a`, `wq_b`, `wkv`, `wo_a`, `wo_b`)	8-bit	affine gs=32	All 43 layers, uniform
Shared experts	8-bit	affine gs=32	1 instance/layer
Compressor + Indexer (long-ctx)	8-bit	affine gs=32	Active when `VMLX_DSV4_LONG_CTX=1`
`embed_tokens`, `lm_head`	8-bit	affine gs=32	Per-token I/O
Norms / router gate / mHC	fp16	passthrough	Required for runtime correctness

vs JANGTQ (premium): JANGTQ has per-importance plan (hash-routed L0-L2 at 4-bit MXTQ, rest at 2-bit MXTQ). JANGTQ2 is uniform 2-bit MXTQ — simpler, smaller risk surface, slightly less aggressive.

Benchmarks

MMLU 200q logit-mode (fair seed, PYTHONHASHSEED=42, identical questions across all bundles)

Bundle	Size	MMLU 200q	Decode tok/s
DeepSeek-V4-Flash-JANGTQ (premium)	79 GB	69.50%	25.91
DeepSeek-V4-Flash-JANGTQ2 (this)	79.6 GB	70.00%	22.34
DeepSeek-V4-Flash-JANG_2L	107 GB	71.50%	23.77
mlx-community/DeepSeek-V4-Flash-2bit-DQ	90 GB	50.00%	36.03

MMLU per-subject (200q stratified, 5 questions per subject)

Subject                                  Score
─────────────────────────────────────────────
high_school_government_and_politics      5/5  (100%)
public_relations                         5/5  (100%)
computer_security                        5/5  (100%)
philosophy                               5/5  (100%)
high_school_us_history                   5/5  (100%)
marketing                                5/5  (100%)
high_school_macroeconomics               5/5  (100%)
high_school_psychology                   5/5  (100%)
high_school_microeconomics               5/5  (100%)
conceptual_physics                       5/5  (100%)
logical_fallacies                        4/5  (80%)
high_school_computer_science             4/5  (80%)
human_sexuality                          4/5  (80%)
college_medicine                         4/5  (80%)
miscellaneous                            4/5  (80%)
clinical_knowledge                       4/5  (80%)
college_physics                          4/5  (80%)
high_school_geography                    4/5  (80%)
professional_medicine                    4/5  (80%)
high_school_biology                      4/5  (80%)
prehistory                               4/5  (80%)
world_religions                          4/5  (80%)
nutrition                                4/5  (80%)
virology                                 3/5  (60%)
high_school_chemistry                    3/5  (60%)
jurisprudence                            3/5  (60%)
professional_law                         3/5  (60%)
management                               3/5  (60%)
moral_disputes                           3/5  (60%)
professional_psychology                  3/5  (60%)
econometrics                             3/5  (60%)
formal_logic                             2/5  (40%)
security_studies                         2/5  (40%)
high_school_european_history             2/5  (40%)
high_school_statistics                   2/5  (40%)
high_school_mathematics                  2/5  (40%)
high_school_world_history                1/5  (20%)
business_ethics                          1/5  (20%)
abstract_algebra                         1/5  (20%)
human_aging                              1/5  (20%)

HumanEval+ pass@1

Coming soon. Greedy T=0.0, max_tokens=4000, seed=42 in flight.

Use

import os
os.environ["JANG_WIRED_LIMIT_GB"] = "160"  # Mac Studio M3 Ultra
# Long context (optional, for >128-token attention recall):
# os.environ["VMLX_DSV4_LONG_CTX"] = "1"

import mlx.core as mx
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm.generate import generate

model, tok = load_jangtq_model("JANGQ-AI/DeepSeek-V4-Flash-JANGTQ2")

text = tok.apply_chat_template(
    [{"role": "user", "content": "What is 2+2?"}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, tok, prompt=text, max_tokens=200, verbose=True))

Related bundles

JANGQ-AI/DeepSeek-V4-Flash-JANGTQ — premium per-importance MXTQ plan (hash 4-bit + rest 2-bit, slightly faster decode)
JANGQ-AI/DeepSeek-V4-Flash-JANG_2L — all-affine 2-bit production (no MXTQ codec)

Credits

Created by Jinho Jang — eric@jangq.ai

Built on top of DeepSeek-V4-Flash (deepseek-ai).

Downloads last month: 17

Safetensors

Model size

20B params

Tensor type

U32

I32

F16

I64

MLX

Hardware compatibility

2-bit

Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ2

Base model

deepseek-ai/DeepSeek-V4-Flash

Finetuned

(7)

this model