DeepSeek-V4-Flash-JANGTQ2
Uniform 2-bit MXTQ TurboQuant baseline of DeepSeek-V4-Flash. 79.6 GB. 70.00% MMLU 200q logit @ 22.3 tok/s on M3 Ultra.
Built with jang_tools for Apple Silicon (MLX). Verified on Mac Studio M3 Ultra.
The canonical baseline tier in the JANG family — uniform 2-bit MXTQ codec on all routed experts (no per-importance per-layer plan). Simpler recipe than JANGTQ premium, matches its quality at fair seed (within 0.5pp) at near-identical size.
Recipe
| Tensor class | Bits | Codec | Notes |
|---|---|---|---|
| Routed experts (all 256 × 43 layers, uniform) | 2-bit | MXTQ codebook | Lloyd-Max codebook + Hadamard rotation |
Attention (wq_a, wq_b, wkv, wo_a, wo_b) |
8-bit | affine gs=32 | All 43 layers, uniform |
| Shared experts | 8-bit | affine gs=32 | 1 instance/layer |
| Compressor + Indexer (long-ctx) | 8-bit | affine gs=32 | Active when VMLX_DSV4_LONG_CTX=1 |
embed_tokens, lm_head |
8-bit | affine gs=32 | Per-token I/O |
| Norms / router gate / mHC | fp16 | passthrough | Required for runtime correctness |
vs JANGTQ (premium): JANGTQ has per-importance plan (hash-routed L0-L2 at 4-bit MXTQ, rest at 2-bit MXTQ). JANGTQ2 is uniform 2-bit MXTQ — simpler, smaller risk surface, slightly less aggressive.
Benchmarks
MMLU 200q logit-mode (fair seed, PYTHONHASHSEED=42, identical questions across all bundles)
| Bundle | Size | MMLU 200q | Decode tok/s |
|---|---|---|---|
| DeepSeek-V4-Flash-JANGTQ (premium) | 79 GB | 69.50% | 25.91 |
| DeepSeek-V4-Flash-JANGTQ2 (this) | 79.6 GB | 70.00% | 22.34 |
| DeepSeek-V4-Flash-JANG_2L | 107 GB | 71.50% | 23.77 |
| mlx-community/DeepSeek-V4-Flash-2bit-DQ | 90 GB | 50.00% | 36.03 |
MMLU per-subject (200q stratified, 5 questions per subject)
Subject Score
─────────────────────────────────────────────
high_school_government_and_politics 5/5 (100%)
public_relations 5/5 (100%)
computer_security 5/5 (100%)
philosophy 5/5 (100%)
high_school_us_history 5/5 (100%)
marketing 5/5 (100%)
high_school_macroeconomics 5/5 (100%)
high_school_psychology 5/5 (100%)
high_school_microeconomics 5/5 (100%)
conceptual_physics 5/5 (100%)
logical_fallacies 4/5 (80%)
high_school_computer_science 4/5 (80%)
human_sexuality 4/5 (80%)
college_medicine 4/5 (80%)
miscellaneous 4/5 (80%)
clinical_knowledge 4/5 (80%)
college_physics 4/5 (80%)
high_school_geography 4/5 (80%)
professional_medicine 4/5 (80%)
high_school_biology 4/5 (80%)
prehistory 4/5 (80%)
world_religions 4/5 (80%)
nutrition 4/5 (80%)
virology 3/5 (60%)
high_school_chemistry 3/5 (60%)
jurisprudence 3/5 (60%)
professional_law 3/5 (60%)
management 3/5 (60%)
moral_disputes 3/5 (60%)
professional_psychology 3/5 (60%)
econometrics 3/5 (60%)
formal_logic 2/5 (40%)
security_studies 2/5 (40%)
high_school_european_history 2/5 (40%)
high_school_statistics 2/5 (40%)
high_school_mathematics 2/5 (40%)
high_school_world_history 1/5 (20%)
business_ethics 1/5 (20%)
abstract_algebra 1/5 (20%)
human_aging 1/5 (20%)
HumanEval+ pass@1
Coming soon. Greedy T=0.0, max_tokens=4000, seed=42 in flight.
Use
import os
os.environ["JANG_WIRED_LIMIT_GB"] = "160" # Mac Studio M3 Ultra
# Long context (optional, for >128-token attention recall):
# os.environ["VMLX_DSV4_LONG_CTX"] = "1"
import mlx.core as mx
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm.generate import generate
model, tok = load_jangtq_model("JANGQ-AI/DeepSeek-V4-Flash-JANGTQ2")
text = tok.apply_chat_template(
[{"role": "user", "content": "What is 2+2?"}],
tokenize=False, add_generation_prompt=True,
)
print(generate(model, tok, prompt=text, max_tokens=200, verbose=True))
Related bundles
JANGQ-AI/DeepSeek-V4-Flash-JANGTQ— premium per-importance MXTQ plan (hash 4-bit + rest 2-bit, slightly faster decode)JANGQ-AI/DeepSeek-V4-Flash-JANG_2L— all-affine 2-bit production (no MXTQ codec)
Credits
Created by Jinho Jang — eric@jangq.ai
Built on top of DeepSeek-V4-Flash (deepseek-ai).
- Downloads last month
- 17
2-bit
Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ2
Base model
deepseek-ai/DeepSeek-V4-Flash