DeepSeek-V4-Flash-JANGTQ2

Uniform 2-bit MXTQ TurboQuant baseline of DeepSeek-V4-Flash. 79.6 GB. 70.00% MMLU 200q logit @ 22.3 tok/s on M3 Ultra.

Built with jang_tools for Apple Silicon (MLX). Verified on Mac Studio M3 Ultra.

The canonical baseline tier in the JANG family — uniform 2-bit MXTQ codec on all routed experts (no per-importance per-layer plan). Simpler recipe than JANGTQ premium, matches its quality at fair seed (within 0.5pp) at near-identical size.

Recipe

Tensor class Bits Codec Notes
Routed experts (all 256 × 43 layers, uniform) 2-bit MXTQ codebook Lloyd-Max codebook + Hadamard rotation
Attention (wq_a, wq_b, wkv, wo_a, wo_b) 8-bit affine gs=32 All 43 layers, uniform
Shared experts 8-bit affine gs=32 1 instance/layer
Compressor + Indexer (long-ctx) 8-bit affine gs=32 Active when VMLX_DSV4_LONG_CTX=1
embed_tokens, lm_head 8-bit affine gs=32 Per-token I/O
Norms / router gate / mHC fp16 passthrough Required for runtime correctness

vs JANGTQ (premium): JANGTQ has per-importance plan (hash-routed L0-L2 at 4-bit MXTQ, rest at 2-bit MXTQ). JANGTQ2 is uniform 2-bit MXTQ — simpler, smaller risk surface, slightly less aggressive.

Benchmarks

MMLU 200q logit-mode (fair seed, PYTHONHASHSEED=42, identical questions across all bundles)

Bundle Size MMLU 200q Decode tok/s
DeepSeek-V4-Flash-JANGTQ (premium) 79 GB 69.50% 25.91
DeepSeek-V4-Flash-JANGTQ2 (this) 79.6 GB 70.00% 22.34
DeepSeek-V4-Flash-JANG_2L 107 GB 71.50% 23.77
mlx-community/DeepSeek-V4-Flash-2bit-DQ 90 GB 50.00% 36.03

MMLU per-subject (200q stratified, 5 questions per subject)

Subject                                  Score
─────────────────────────────────────────────
high_school_government_and_politics      5/5  (100%)
public_relations                         5/5  (100%)
computer_security                        5/5  (100%)
philosophy                               5/5  (100%)
high_school_us_history                   5/5  (100%)
marketing                                5/5  (100%)
high_school_macroeconomics               5/5  (100%)
high_school_psychology                   5/5  (100%)
high_school_microeconomics               5/5  (100%)
conceptual_physics                       5/5  (100%)
logical_fallacies                        4/5  (80%)
high_school_computer_science             4/5  (80%)
human_sexuality                          4/5  (80%)
college_medicine                         4/5  (80%)
miscellaneous                            4/5  (80%)
clinical_knowledge                       4/5  (80%)
college_physics                          4/5  (80%)
high_school_geography                    4/5  (80%)
professional_medicine                    4/5  (80%)
high_school_biology                      4/5  (80%)
prehistory                               4/5  (80%)
world_religions                          4/5  (80%)
nutrition                                4/5  (80%)
virology                                 3/5  (60%)
high_school_chemistry                    3/5  (60%)
jurisprudence                            3/5  (60%)
professional_law                         3/5  (60%)
management                               3/5  (60%)
moral_disputes                           3/5  (60%)
professional_psychology                  3/5  (60%)
econometrics                             3/5  (60%)
formal_logic                             2/5  (40%)
security_studies                         2/5  (40%)
high_school_european_history             2/5  (40%)
high_school_statistics                   2/5  (40%)
high_school_mathematics                  2/5  (40%)
high_school_world_history                1/5  (20%)
business_ethics                          1/5  (20%)
abstract_algebra                         1/5  (20%)
human_aging                              1/5  (20%)

HumanEval+ pass@1

Coming soon. Greedy T=0.0, max_tokens=4000, seed=42 in flight.

Use

import os
os.environ["JANG_WIRED_LIMIT_GB"] = "160"  # Mac Studio M3 Ultra
# Long context (optional, for >128-token attention recall):
# os.environ["VMLX_DSV4_LONG_CTX"] = "1"

import mlx.core as mx
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm.generate import generate

model, tok = load_jangtq_model("JANGQ-AI/DeepSeek-V4-Flash-JANGTQ2")

text = tok.apply_chat_template(
    [{"role": "user", "content": "What is 2+2?"}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, tok, prompt=text, max_tokens=200, verbose=True))

Related bundles

Credits

Created by Jinho Jang — eric@jangq.ai

Built on top of DeepSeek-V4-Flash (deepseek-ai).

Downloads last month
17
Safetensors
Model size
20B params
Tensor type
U32
·
I32
·
F16
·
I64
·
MLX
Hardware compatibility
Log In to add your hardware

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ2

Finetuned
(7)
this model