Spaces:

studyOverflow
/

MBenchAnnotation

Sleeping

App Files Files Community

studyOverflow commited on 6 days ago

Commit

c8f2a5f

verified ·

1 Parent(s): bffee2e

feat: migrate to MBench-V-new + MBench-A-New (V binary + V pairwise + A pairwise tabs)

Browse files

Files changed (3) hide show

README.md +17 -14
app.py +517 -634
sampling/new_task_pools.json +0 -0

README.md CHANGED Viewed

@@ -10,23 +10,26 @@ app_file: app.py
 pinned: false
 ---
-# MBench-V Human Annotation
-Gradio-based annotation UI for the MBench-V video generation benchmark.
-- **Video source (read-only)**: [studyOverflow/TempMemoryData](https://huggingface.co/datasets/studyOverflow/TempMemoryData), streamed directly from HF CDN — videos are **not** copied into this Space.
-- **Annotation sink (write)**: the same dataset repo, under `annotations/`. Submissions are batched by `CommitScheduler` and pushed every 5 minutes.
-- **Models included (6)**: `causal_forcing`, `self_forcing`, `cosmos`, `helios`, `longlive`, `memflow`. `skyreels` and `longcat` are temporarily excluded because their 0422 generation is still in progress.
-- **Tasks**: 584 task_ids × 6 models = **3504** `(model, task_id)` pairs.
-## How to use
-1. Enter your annotator name (anything unique — used to tag your submissions).
-2. Watch the video on the left; read the prompt and metadata in the middle.
-3. Give a score (1–5) and an optional note on the right.
-4. Click **Submit & Next** to move on. Your submissions are auto-committed every 5 min.
-## Notes
-- This is a minimal template. Multi-annotator deduplication, per-user task-allocation, and per-dimension scoring are **not** implemented yet — all annotators currently get a randomly shuffled pool and see tasks in their own order.
-- The environment variable `HF_TOKEN` must be set in the Space *Settings → Variables and secrets* with **write** access to `studyOverflow/TempMemoryData`.

 pinned: false
 ---
+# MBench Annotation Platform (NEW)
+Adapted to the new dataset layout (`MBench-V-new` + `MBench-A-New`) on
+[`studyOverflow/TempMemoryData`](https://huggingface.co/datasets/studyOverflow/TempMemoryData).
+## Tabs
+1. **MBench-V Binary** — single video, "is there a memory issue?" (yes/no)
+2. **MBench-V Pairwise** — two T2V videos, 5 dimensions
+3. **MBench-A Pairwise** — two world-model videos, ≤6 dimensions
+## Annotation Sink
+Submissions are pushed to `annotations-new/` on the dataset repo every 5 minutes via
+`CommitScheduler`. Old `annotations/` is left untouched (legacy).
+## Migrated Historical Data
+`annotations-new/` already contains:
+- `migrated_v_binary.jsonl` (642 records from old `ann_bc109d66.jsonl`)
+- `migrated_a_pairwise.jsonl` (821 records from old `ann_mbench_a_*.jsonl`)
+These are read on startup so existing annotators don't see already-completed tasks again.

app.py CHANGED Viewed

@@ -1,18 +1,20 @@
 """
-MBench Annotation Space — 单视频标注 + MBench-V Pairwise + MBench-A Pairwise
-功能:
-- Tab 1 (单视频标注): "该视频是否出现了记忆问题？" (MBench-V)
-- Tab 2 (MBench-V Pairwise): 同一 prompt 下两个 T2V 模型视频并排 (MBench-V)
-- Tab 3 (MBench-A Pairwise): 世界模型 401f 视频对比，4子集×多维度 (MBench-A)
-技术栈:
-- Gradio 5.9.1 + FastAPI 视频代理
-- HuggingFace CommitScheduler 自动推送标注结果
-- 数据来源: studyOverflow/TempMemoryData
-部署:
-  直接替换 HuggingFace Space 的 app.py 即可。
 """
 from __future__ import annotations
@@ -34,184 +36,107 @@ from huggingface_hub import CommitScheduler, HfApi, hf_hub_download, hf_hub_url
 # ---------------------------------------------------------------------------
 DATASET_REPO = "studyOverflow/TempMemoryData"
-MERGED_JSON_PATH = "MBench-V/merged.json"
-MODELS: list[str] = [
-    "causal_forcing",
-    "self_forcing",
-    "cosmos",
-    "helios",
-    "longlive",
-    "memflow",
-    "longcat",
-    "skyreels",
-]
 HF_TOKEN = os.environ.get("HF_TOKEN")
 ANN_DIR = Path("annotations_local")
 ANN_DIR.mkdir(exist_ok=True)
 PROCESS_ID = uuid.uuid4().hex[:8]
-# Separate files for annotation types
-ANN_FILE_BINARY = ANN_DIR / f"ann_binary_{PROCESS_ID}.jsonl"
-ANN_FILE_PAIRWISE = ANN_DIR / f"ann_pairwise_{PROCESS_ID}.jsonl"
-ANN_FILE_MBENCH_A = ANN_DIR / f"ann_mbench_a_{PROCESS_ID}.jsonl"
 COMMIT_INTERVAL_MIN = 5
 PENDING_TIMEOUT_SEC = 30 * 60
-# MBench-V Pairwise config
-PAIRWISE_DIMENSIONS = [
-    ("entity", "实体一致性", "人物/物体离开画面再回来后，哪个视频中实体外观更一致？"),
-    ("physical", "物理合理性", "哪个视频中的物理过程（水流/碰撞/变形等）更合理自然？"),
-    ("prompt", "Prompt 忠实度", "哪个视频的内容更符合下方的文字描述？"),
-]
-PAIRWISE_SAMPLES_PER_PAIR = 30
-# ---------------------------------------------------------------------------
-# MBench-A Config
-# ---------------------------------------------------------------------------
-MBENCH_A_MODELS: list[str] = [
-    "hy_worldplay",
-    "infinite_world",
-    "lingbot_world",
-    "matrix_game_2",
-    "matrix_game_3",
-    "yume",
-]
-MBENCH_A_ANNOTATORS_PER_TASK = 3
-MBENCH_A_CATEGORY_MAP = {
-    "environment": "Spatial_401f",
-    "object": "Spatial_401f",
-    "human": "Human_401f",
-    "causal": "Casual_401f",
-}
-MBENCH_A_GT_CATEGORY_MAP = {
-    "environment": "Spatial",
-    "object": "Spatial",
-    "human": "Human",
-    "causal": "Casual",
-}
 # ---------------------------------------------------------------------------
-# Load MBench-V merged.json
 # ---------------------------------------------------------------------------
-def _load_merged() -> list[dict[str, Any]]:
-    try:
-        local = hf_hub_download(
-            repo_id=DATASET_REPO,
-            filename=MERGED_JSON_PATH,
-            repo_type="dataset",
-            token=HF_TOKEN,
-        )
         with open(local, encoding="utf-8") as f:
             return json.load(f)
-    except Exception as e:
-        print(f"[mbench-ann] WARNING: Failed to load MBench-V data: {e}")
-        return []
-TASKS: list[dict[str, Any]] = _load_merged()
-TASK_BY_ID: dict[str, dict[str, Any]] = {t["task_id"]: t for t in TASKS}
-# ---------------------------------------------------------------------------
-# Load MBench-A task pool
-# ---------------------------------------------------------------------------
-def _load_mbench_a_pool() -> dict[str, Any]:
-    """Load MBench-A task pool from local file or HF."""
-    local_path = Path(__file__).parent / "sampling" / "task_pool.json"
-    if local_path.exists():
-        with open(local_path, encoding="utf-8") as f:
-            return json.load(f)
-    # Fallback: try HF
-    try:
-        local = hf_hub_download(
-            repo_id=DATASET_REPO,
-            filename="MBench-A/task_pool.json",
-            repo_type="dataset",
-            token=HF_TOKEN,
-        )
-        with open(local, encoding="utf-8") as f:
-            return json.load(f)
-    except Exception as e:
-        print(f"[mbench-ann] WARNING: Failed to load MBench-A task pool: {e}")
-        return {"tasks": [], "quality_control_tasks": [], "metadata": {}}
-MBENCH_A_POOL = _load_mbench_a_pool()
-MBENCH_A_TASKS: list[dict] = MBENCH_A_POOL.get("tasks", []) + MBENCH_A_POOL.get("quality_control_tasks", [])
-MBENCH_A_TASK_BY_ID: dict[str, dict] = {t["task_id"]: t for t in MBENCH_A_TASKS}
 # ---------------------------------------------------------------------------
-# MBench-V Pool setup
 # ---------------------------------------------------------------------------
-BINARY_POOL: list[tuple[str, str]] = [(m, t["task_id"]) for m in MODELS for t in TASKS]
-BINARY_POOL_SET: set[tuple[str, str]] = set(BINARY_POOL)
-def _build_pairwise_pool() -> list[tuple[str, str, str, str]]:
-    pool = []
-    task_ids = [t["task_id"] for t in TASKS[:PAIRWISE_SAMPLES_PER_PAIR]]
-    for tid in task_ids:
-        for i, m_a in enumerate(MODELS):
-            for m_b in MODELS[i+1:]:
-                for dim_key, _, _ in PAIRWISE_DIMENSIONS:
-                    pool.append((tid, m_a, m_b, dim_key))
-    return pool
-PAIRWISE_POOL: list[tuple[str, str, str, str]] = _build_pairwise_pool()
-PAIRWISE_POOL_SET: set[tuple[str, str, str, str]] = set(PAIRWISE_POOL)
-print(f"[mbench-ann] MBench-V: {len(TASKS)} tasks × {len(MODELS)} models")
-print(f"[mbench-ann] MBench-V binary pool: {len(BINARY_POOL)}, pairwise pool: {len(PAIRWISE_POOL)}")
-print(f"[mbench-ann] MBench-A: {len(MBENCH_A_TASKS)} tasks, {len(MBENCH_A_POOL.get('metadata', {}))} metadata")
 # ---------------------------------------------------------------------------
-# Video URL helpers
 # ---------------------------------------------------------------------------
-def _video_url(model: str, task_id: str) -> str:
-    return f"/video/{model}/{task_id}.mp4"
-def _hf_video_url(model: str, task_id: str) -> str:
     return hf_hub_url(
         DATASET_REPO,
-        filename=f"MBench-V/{model}/videos/{task_id}.mp4",
         repo_type="dataset",
     )
-def _mbench_a_video_proxy_url(model: str, subset: str, sample_id: str) -> str:
-    """Build local proxy URL for MBench-A video."""
-    category = MBENCH_A_CATEGORY_MAP[subset]
-    return f"/video_a/{model}/{category}/{sample_id}/left_then_right.mp4"
-def _mbench_a_hf_video_url(model: str, category: str, sample_id: str) -> str:
-    """Build HF upstream URL for MBench-A video."""
-    return hf_hub_url(
-        DATASET_REPO,
-        filename=f"MBench-A/{model}/{category}/{sample_id}/left_then_right.mp4",
-        repo_type="dataset",
-    )
-def _mbench_a_asset_hf_url(path: str) -> str:
-    """Build HF URL for MBench-A assets."""
     return hf_hub_url(
         DATASET_REPO,
-        filename=f"MBench-A/assets/{path}",
         repo_type="dataset",
     )
-def _extract_prompt(task: dict[str, Any]) -> str:
-    gp = task.get("generation_prompts") or {}
-    prompts = gp.get("prompts") or {}
-    for level in ("level_3", "level_4", "level_2", "level_1"):
-        val = prompts.get(level)
-        if isinstance(val, list) and val:
-            n = len(val)
-            return "\n\n".join(f"— 第 {i}/{n} 段 —\n{seg}" for i, seg in enumerate(val, 1))
-        if isinstance(val, str) and val:
-            return val
-    return "(no prompt found)"
 def _render_video_html(url: str) -> str:
     return (
@@ -221,94 +146,7 @@ def _render_video_html(url: str) -> str:
     )
 # ---------------------------------------------------------------------------
-# MBench-A: Auxiliary info rendering
-# ---------------------------------------------------------------------------
-def _render_mbench_a_aux(task: dict) -> str:
-    """Render auxiliary HTML info based on task subset."""
-    subset = task["subset"]
-    # Use CSS class for guaranteed visibility (Gradio themes can override inline styles)
-    box = 'class="aux-info-box"'
-    # Camera motion info (shown for ALL subsets)
-    motion = task.get("camera_motion", "left_then_right")
-    motion_desc = task.get("camera_motion_description", motion)
-    gif_url = _mbench_a_asset_hf_url(f"camera_diagrams/{motion}.gif")
-    camera_html = (
-        f'<div style="flex:0 0 200px">'
-        f'<p><b>🎬 预期相机运动</b></p>'
-        f'<p style="margin:0 0 8px">{motion_desc}</p>'
-        f'<img src="{gif_url}" style="width:180px">'
-        f'</div>'
-    )
-    # Caption (shown for ALL subsets now)
-    caption = task.get("caption", "")
-    caption_html = ""
-    if caption:
-        caption_html = (
-            f'<div style="flex:1;min-width:250px">'
-            f'<p><b>📝 场景描述</b></p>'
-            f'<p style="font-size:14px;line-height:1.5">{caption}</p>'
-            f'</div>'
-        )
-    if subset == "object":
-        sample_id = task["sample_id"]
-        mask_url = _mbench_a_asset_hf_url(f"mask_viz/{sample_id}.png")
-        return (
-            f'<div {box}>'
-            f'<p><b>🎯 请关注画面中被标注（高亮）的物体</b></p>'
-            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start;margin-top:8px">'
-            f'<div style="flex:1;min-width:300px">'
-            f'<img src="{mask_url}" style="max-width:100%;max-height:280px">'
-            f'</div>'
-            f'{camera_html}'
-            f'{caption_html}'
-            f'</div></div>'
-        )
-    elif subset == "causal":
-        return (
-            f'<div {box}>'
-            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start">'
-            f'{camera_html}'
-            f'{caption_html}'
-            f'</div></div>'
-        )
-    elif subset == "human":
-        return (
-            f'<div {box}>'
-            f'<p><b>👤 请关注视频中的人物</b>：观察人物离开画面再回来后，面部和外观是否保持一致。</p>'
-            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start;margin-top:8px">'
-            f'{camera_html}'
-            f'{caption_html}'
-            f'</div></div>'
-        )
-    else:  # environment
-        return (
-            f'<div {box}>'
-            f'<p><b>🏞️ 请关注整体场景</b>：观察相机转回来后，场景的布局、风格、光照是否保持一致。</p>'
-            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start;margin-top:8px">'
-            f'{camera_html}'
-            f'{caption_html}'
-            f'</div></div>'
-        )
-        return (
-            f'<div {box}>'
-            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start">'
-            f'<div style="flex:1;min-width:250px">'
-            f'<p><b>🏞️ 请关注整体场景</b>：观察相机转回来后，场景的布局、风格、光照是否保持一致。</p>'
-            f'</div>'
-            f'{camera_html}'
-            f'</div></div>'
-        )
-# ---------------------------------------------------------------------------
-# CommitScheduler
 # ---------------------------------------------------------------------------
 scheduler: CommitScheduler | None = None
@@ -317,7 +155,7 @@ if HF_TOKEN:
         repo_id=DATASET_REPO,
         repo_type="dataset",
         folder_path=str(ANN_DIR),
-        path_in_repo="annotations",
         every=COMMIT_INTERVAL_MIN,
         token=HF_TOKEN,
         private=False,
@@ -325,20 +163,21 @@ if HF_TOKEN:
     )
 # ---------------------------------------------------------------------------
-# Historical annotations
 # ---------------------------------------------------------------------------
-def _fetch_remote_annotations() -> list[dict[str, Any]]:
-    records: list[dict[str, Any]] = []
     try:
         api = HfApi(token=HF_TOKEN)
         files = api.list_repo_files(repo_id=DATASET_REPO, repo_type="dataset")
     except Exception:
         return records
-    jsonls = [p for p in files if p.startswith("annotations/") and p.endswith(".jsonl")]
     for path in jsonls:
         try:
-            local = hf_hub_download(repo_id=DATASET_REPO, filename=path, repo_type="dataset", token=HF_TOKEN)
             with open(local, encoding="utf-8") as f:
                 for line in f:
                     line = line.strip()
@@ -351,7 +190,8 @@ def _fetch_remote_annotations() -> list[dict[str, Any]]:
             pass
     return records
-HISTORICAL = _fetch_remote_annotations()
 # ---------------------------------------------------------------------------
 # Shared state
@@ -359,49 +199,43 @@ HISTORICAL = _fetch_remote_annotations()
 STATE_LOCK = threading.Lock()
-# Binary state
-BINARY_SUBMITTED: set[tuple[str, str]] = {
-    (r["model"], r["task_id"]) for r in HISTORICAL
-    if r.get("type", "binary") == "binary" and "model" in r and "task_id" in r
-    and (r["model"], r["task_id"]) in BINARY_POOL_SET
-}
-BINARY_PENDING: dict[tuple[str, str], tuple[str, float]] = {}
-# MBench-V Pairwise state
-PAIRWISE_SUBMITTED: set[tuple[str, str, str, str]] = {
-    (r["task_id"], r["model_a"], r["model_b"], r["dimension"])
-    for r in HISTORICAL
-    if r.get("type") == "pairwise"
-    and all(k in r for k in ("task_id", "model_a", "model_b", "dimension"))
-}
-PAIRWISE_PENDING: dict[tuple[str, str, str, str], tuple[str, float]] = {}
-# MBench-A state: task_id -> list of annotators who completed it
-MBENCH_A_COMPLETED: dict[str, list[str]] = defaultdict(list)
 for r in HISTORICAL:
-    if r.get("type") == "pairwise_mbench_a" and "task_id" in r and "annotator" in r:
-        tid = r["task_id"]
-        # Handle old format where task_id might be stored differently
-        if tid in MBENCH_A_TASK_BY_ID:
-            MBENCH_A_COMPLETED[tid].append(r["annotator"])
-MBENCH_A_PENDING: dict[str, tuple[str, float]] = {}
-print(f"[mbench-ann] binary submitted: {len(BINARY_SUBMITTED)}")
-print(f"[mbench-ann] pairwise submitted: {len(PAIRWISE_SUBMITTED)}")
-print(f"[mbench-ann] MBench-A completed: {sum(len(v) for v in MBENCH_A_COMPLETED.values())} annotations across {len(MBENCH_A_COMPLETED)} tasks")
 # ---------------------------------------------------------------------------
-# Queue helpers
 # ---------------------------------------------------------------------------
-def _reap_expired(pending_dict):
     now = time.time()
-    expired = [k for k, (_, ts) in pending_dict.items() if now - ts > PENDING_TIMEOUT_SEC]
     for k in expired:
-        pending_dict.pop(k, None)
-def _append_annotation(record: dict[str, Any], ann_file: Path) -> None:
     line = json.dumps(record, ensure_ascii=False)
     if scheduler is not None:
         with scheduler.lock:
@@ -411,394 +245,422 @@ def _append_annotation(record: dict[str, Any], ann_file: Path) -> None:
         with ann_file.open("a", encoding="utf-8") as f:
             f.write(line + "\n")
 # ---------------------------------------------------------------------------
-# Binary annotation callbacks (MBench-V)
 # ---------------------------------------------------------------------------
-def binary_start(annotator: str, state: dict):
     annotator = (annotator or "").strip()
     if not annotator:
-        return state, "<p>请先输入名字。</p>", "", "", "⚠️ 请输入名字", ""
-    order = list(range(len(BINARY_POOL)))
     random.shuffle(order)
-    state = {"annotator": annotator, "order": order, "idx": 0, "current": None, "count": 0}
-    return _binary_next(state)
-def _binary_next(state):
     annotator = state["annotator"]
     order = state["order"]
     idx = state.get("idx", 0)
     with STATE_LOCK:
-        _reap_expired(BINARY_PENDING)
         while idx < len(order):
-            mt = BINARY_POOL[order[idx]]
-            if mt in BINARY_SUBMITTED or mt in BINARY_PENDING:
-                idx += 1
-                continue
-            BINARY_PENDING[mt] = (annotator, time.time())
             state["idx"] = idx
-            state["current"] = mt
-            model, task_id = mt
-            task = TASK_BY_ID[task_id]
-            video_html = _render_video_html(_video_url(model, task_id))
-            meta = f"**模型**: `{model}` | **task_id**: `{task_id}` | **已提交**: {state['count']}"
-            prompt = _extract_prompt(task)
-            n_sub = len(BINARY_SUBMITTED)
-            stats = f"全局进度: {n_sub}/{len(BINARY_POOL)} ({100*n_sub/len(BINARY_POOL):.1f}%)"
-            return state, video_html, meta, prompt, f"✅ 已加载", stats
         state["current"] = None
-        return state, "<p>🎉 全部完成！</p>", "全部标注完成", "", "完成", f"已完成 {len(BINARY_SUBMITTED)}/{len(BINARY_POOL)}"
-def binary_submit(state, verdict, note):
     if not state or not state.get("current"):
-        return state, "<p>请先登录</p>", "", "", "否", "", "⚠️", ""
-    mt = state["current"]
-    model, task_id = mt
     record = {
-        "type": "binary",
         "timestamp": time.time(),
         "annotator": state["annotator"],
-        "model": model,
-        "task_id": task_id,
         "memory_issue": verdict == "是",
         "verdict": verdict,
         "note": (note or "").strip(),
     }
-    _append_annotation(record, ANN_FILE_BINARY)
     with STATE_LOCK:
-        BINARY_PENDING.pop(mt, None)
-        BINARY_SUBMITTED.add(mt)
     state["count"] = state.get("count", 0) + 1
     state["idx"] = state["idx"] + 1
     state["current"] = None
-    result = _binary_next(state)
-    return result[0], result[1], result[2], result[3], "否", "", f"✅ 已提交第 {state['count']} 条", result[5]
-def binary_skip(state):
     if not state or not state.get("current"):
-        return state, "<p>请先登录</p>", "", "", "否", "", "⚠️", ""
-    mt = state["current"]
     with STATE_LOCK:
-        BINARY_PENDING.pop(mt, None)
     state["idx"] = state["idx"] + 1
     state["current"] = None
-    result = _binary_next(state)
-    return result[0], result[1], result[2], result[3], "否", "", "⏭️ 已跳过", result[5]
 # ---------------------------------------------------------------------------
-# MBench-V Pairwise annotation callbacks
 # ---------------------------------------------------------------------------
-def pairwise_start(annotator: str, dimension: str, state: dict):
     annotator = (annotator or "").strip()
     if not annotator:
-        return state, "<p>请先输入名字。</p>", "<p></p>", "", "", "⚠️ 请输入名字", ""
-    dim_pool = [(i, item) for i, item in enumerate(PAIRWISE_POOL) if item[3] == dimension]
-    order = list(range(len(dim_pool)))
     random.shuffle(order)
-    state = {
-        "annotator": annotator, "dimension": dimension, "dim_pool": dim_pool,
-        "order": order, "idx": 0, "current": None, "count": 0,
-    }
-    return _pairwise_next(state)
-def _pairwise_next(state):
     annotator = state["annotator"]
-    dim_pool = state["dim_pool"]
     order = state["order"]
     idx = state.get("idx", 0)
-    dimension = state["dimension"]
-    dim_label = dimension
-    dim_question = ""
-    for dk, dl, dq in PAIRWISE_DIMENSIONS:
-        if dk == dimension:
-            dim_label = dl
-            dim_question = dq
-            break
     with STATE_LOCK:
-        _reap_expired(PAIRWISE_PENDING)
         while idx < len(order):
-            pool_idx, item = dim_pool[order[idx]]
-            tid, m_a, m_b = item[0], item[1], item[2]
-            if item in PAIRWISE_SUBMITTED or item in PAIRWISE_PENDING:
-                idx += 1
-                continue
-            PAIRWISE_PENDING[item] = (annotator, time.time())
             state["idx"] = idx
-            state["current"] = item
             if random.random() < 0.5:
-                left_model, right_model = m_a, m_b
-                state["swapped"] = False
             else:
-                left_model, right_model = m_b, m_a
-                state["swapped"] = True
-            task = TASK_BY_ID[tid]
-            video_a_html = _render_video_html(_video_url(left_model, tid))
-            video_b_html = _render_video_html(_video_url(right_model, tid))
-            prompt = _extract_prompt(task)
-            meta = f"**维度**: {dim_label} | **问题**: {dim_question}\n\n**已提交**: {state['count']}"
-            n_sub = sum(1 for x in PAIRWISE_SUBMITTED if x[3] == dimension)
-            n_total = len(dim_pool)
-            stats = f"维度「{dim_label}」进度: {n_sub}/{n_total} ({100*n_sub/n_total:.1f}%)"
-            return state, video_a_html, video_b_html, meta, prompt, "✅ 已加载", stats
         state["current"] = None
-        return state, "<p>🎉 该维度全部完成！</p>", "", "全部完成", "", "完成", ""
-def pairwise_submit(state, verdict, note):
     if not state or not state.get("current"):
-        return state, "", "", "", "", "⚠️ 请先登录", ""
-    item = state["current"]
-    tid, m_a, m_b, dimension = item
-    swapped = state.get("swapped", False)
-    if verdict == "左边更好":
-        winner = m_b if swapped else m_a
-    elif verdict == "右边更好":
-        winner = m_a if swapped else m_b
-    else:
-        winner = "tie"
     record = {
-        "type": "pairwise",
         "timestamp": time.time(),
         "annotator": state["annotator"],
         "task_id": tid,
-        "model_a": m_a,
-        "model_b": m_b,
-        "dimension": dimension,
-        "winner": winner,
-        "verdict_raw": verdict,
         "swapped": swapped,
         "note": (note or "").strip(),
     }
-    _append_annotation(record, ANN_FILE_PAIRWISE)
     with STATE_LOCK:
-        PAIRWISE_PENDING.pop(item, None)
-        PAIRWISE_SUBMITTED.add(item)
     state["count"] = state.get("count", 0) + 1
     state["idx"] = state["idx"] + 1
     state["current"] = None
-    result = _pairwise_next(state)
-    return result[0], result[1], result[2], result[3], result[4], f"✅ 已提交第 {state['count']} 条", result[6]
-def pairwise_skip(state):
     if not state or not state.get("current"):
-        return state, "", "", "", "", "⚠️ 请先登录", ""
-    item = state["current"]
     with STATE_LOCK:
-        PAIRWISE_PENDING.pop(item, None)
     state["idx"] = state["idx"] + 1
     state["current"] = None
-    result = _pairwise_next(state)
-    return result[0], result[1], result[2], result[3], result[4], "⏭️ 已跳过", result[6]
 # ---------------------------------------------------------------------------
-# MBench-A Pairwise annotation callbacks
 # ---------------------------------------------------------------------------
-def mbench_a_start(annotator: str, state: dict):
-    """Login for MBench-A annotation."""
     annotator = (annotator or "").strip()
     if not annotator:
         return (state, "⚠️ 请输入名字", "", "", "", "",
-                gr.update(visible=False), gr.update(visible=False),
-                gr.update(visible=False), gr.update(visible=False),
-                gr.update(visible=False),
-                "", "")
-    # Count how many tasks this annotator has already completed.
-    # Check both:
-    # 1. MBENCH_A_COMPLETED (loaded from HF at startup + updated in-memory during this session)
-    # 2. The local annotation file (captures annotations made this session before any push)
-    historical_count = sum(
-        1 for anns in MBENCH_A_COMPLETED.values()
-        if annotator in anns
-    )
-    # Also scan the local file in case this session's annotations haven't been pushed yet
-    if ANN_FILE_MBENCH_A.exists():
-        with ANN_FILE_MBENCH_A.open() as f:
-            for line in f:
-                line = line.strip()
-                if not line:
-                    continue
-                try:
-                    r = json.loads(line)
-                    if r.get("annotator") == annotator and r.get("type") == "pairwise_mbench_a":
-                        tid = r.get("task_id", "")
-                        # Only count if not already counted in MBENCH_A_COMPLETED
-                        if tid in MBENCH_A_TASK_BY_ID and annotator not in MBENCH_A_COMPLETED.get(tid, []):
-                            historical_count += 1
-                except Exception:
-                    pass
-    # Shuffle task order for this annotator
-    order = list(range(len(MBENCH_A_TASKS)))
     random.shuffle(order)
-    state = {
-        "annotator": annotator,
-        "order": order,
-        "idx": 0,
-        "current_task_id": None,
-        "swapped": False,
-        "left_model": None,
-        "right_model": None,
-        "count": historical_count,
-    }
-    return _mbench_a_next(state)
-def _mbench_a_next(state: dict):
-    """Find and load the next available MBench-A task."""
     annotator = state["annotator"]
     order = state["order"]
     idx = state.get("idx", 0)
     with STATE_LOCK:
-        _reap_expired(MBENCH_A_PENDING)
         while idx < len(order):
-            task = MBENCH_A_TASKS[order[idx]]
             tid = task["task_id"]
-            # Skip if already fully annotated
-            if len(MBENCH_A_COMPLETED.get(tid, [])) >= MBENCH_A_ANNOTATORS_PER_TASK:
-                idx += 1
-                continue
-            # Skip if this annotator already did it
-            if annotator in MBENCH_A_COMPLETED.get(tid, []):
-                idx += 1
-                continue
-            # Skip if currently pending by someone else
-            if tid in MBENCH_A_PENDING and MBENCH_A_PENDING[tid][0] != annotator:
-                idx += 1
-                continue
-            # Assign this task
-            MBENCH_A_PENDING[tid] = (annotator, time.time())
             state["idx"] = idx
-            state["current_task_id"] = tid
-            # Randomly swap A/B
-            m_a, m_b = task["model_a"], task["model_b"]
             if random.random() < 0.5:
-                state["left_model"], state["right_model"] = m_a, m_b
-                state["swapped"] = False
             else:
-                state["left_model"], state["right_model"] = m_b, m_a
-                state["swapped"] = True
-            # Build UI outputs
-            subset = task["subset"]
-            video_left = _render_video_html(
-                _mbench_a_video_proxy_url(state["left_model"], subset, task["sample_id"]))
-            video_right = _render_video_html(
-                _mbench_a_video_proxy_url(state["right_model"], subset, task["sample_id"]))
-            aux_html = _render_mbench_a_aux(task)
-            # Dimension questions
             dimensions = task["dimensions"]
-            dim_questions = task.get("dimension_questions", {})
-            # Build question radio updates (max 5)
             q_updates = []
             for i in range(6):
                 if i < len(dimensions):
-                    dim_key = dimensions[i]
-                    question_text = dim_questions.get(dim_key, dim_key)
-                    q_updates.append(gr.update(
-                        visible=True,
-                        label=question_text,
-                        value="差不多",
-                    ))
                 else:
                     q_updates.append(gr.update(visible=False, value="差不多"))
-            # Meta info
-            subset_names = {"environment": "🏞️ Environment", "object": "🎯 Object",
-                           "human": "👤 Human", "causal": "⚡ Causal"}
-            n_done = sum(1 for t in MBENCH_A_TASKS
-                        if len(MBENCH_A_COMPLETED.get(t["task_id"], [])) >= MBENCH_A_ANNOTATORS_PER_TASK)
-            meta = (f"**子集**: {subset_names.get(subset, subset)} | "
-                    f"**已提交**: {state['count']}")
-            stats = (f"全局进度: {n_done}/{len(MBENCH_A_TASKS)} tasks 完成 | "
-                     f"你已标注: {state['count']}")
-            return (state, "✅ 已加载", aux_html, video_left, video_right, meta,
                     *q_updates, "", stats)
-        # All done
-        state["current_task_id"] = None
-        empty_q = gr.update(visible=False, value="差不多")
-        return (state, "🎉 全部完成！", "", "<p>所有任务已完成</p>", "", "全部完成",
-                empty_q, empty_q, empty_q, empty_q, empty_q, empty_q, "", "")
-def mbench_a_submit(state, q1_val, q2_val, q3_val, q4_val, q5_val, q6_val, note):
-    """Submit MBench-A multi-dimension annotation."""
-    if not state or not state.get("current_task_id"):
-        empty_q = gr.update(visible=False, value="差不多")
         return (state, "⚠️ 请先登录", "", "", "", "",
-                empty_q, empty_q, empty_q, empty_q, empty_q, empty_q, "", "")
-    tid = state["current_task_id"]
-    task = MBENCH_A_TASK_BY_ID[tid]
-    dimensions = task["dimensions"]
     swapped = state["swapped"]
-    m_a, m_b = task["model_a"], task["model_b"]
-    # Map verdicts to winners
-    verdicts = [q1_val, q2_val, q3_val, q4_val, q5_val, q6_val]
     dim_results = {}
-    for i, dim_key in enumerate(dimensions):
         v = verdicts[i]
         if v == "A更好":
-            # A is left; if swapped, left is model_b
-            winner = m_b if swapped else m_a
         elif v == "B更好":
-            winner = m_a if swapped else m_b
         else:
             winner = "tie"
-        dim_results[dim_key] = winner
     record = {
-        "type": "pairwise_mbench_a",
         "timestamp": time.time(),
         "annotator": state["annotator"],
         "task_id": tid,
         "subset": task["subset"],
         "sample_id": task["sample_id"],
-        "camera_motion": task.get("camera_motion", "left_then_right"),
-        "model_a": m_a,
-        "model_b": m_b,
         "dimensions": dim_results,
         "swapped": swapped,
         "note": (note or "").strip(),
     }
-    _append_annotation(record, ANN_FILE_MBENCH_A)
     with STATE_LOCK:
-        MBENCH_A_PENDING.pop(tid, None)
-        MBENCH_A_COMPLETED[tid].append(state["annotator"])
     state["count"] = state.get("count", 0) + 1
     state["idx"] = state["idx"] + 1
-    state["current_task_id"] = None
-    return _mbench_a_next(state)
-def mbench_a_skip(state):
-    """Skip current MBench-A task."""
-    if not state or not state.get("current_task_id"):
-        empty_q = gr.update(visible=False, value="差不多")
         return (state, "⚠️ 请先登录", "", "", "", "",
-                empty_q, empty_q, empty_q, empty_q, empty_q, empty_q, "", "")
-    tid = state["current_task_id"]
     with STATE_LOCK:
-        MBENCH_A_PENDING.pop(tid, None)
     state["idx"] = state["idx"] + 1
-    state["current_task_id"] = None
-    return _mbench_a_next(state)
 # ---------------------------------------------------------------------------
 # UI
@@ -806,62 +668,101 @@ def mbench_a_skip(state):
 CUSTOM_CSS = """
 #prompt_box textarea { height: 300px !important; overflow-y: auto !important; }
-.video-pair { display: flex; gap: 12px; }
-.video-pair > div { flex: 1; }
-/* Force aux info box to be visible regardless of Gradio theme */
 .aux-info-box {
-    background: #e3e8ef !important;
-    color: #111 !important;
-    padding: 14px !important;
-    border-radius: 8px !important;
-    margin-bottom: 12px !important;
-    border: 1px solid #b0b8c4 !important;
-}
-.aux-info-box * {
-    color: #111 !important;
-}
-.aux-info-box img {
-    border: 1px solid #999;
-    border-radius: 4px;
 }
 """
-with gr.Blocks(title="MBench 标注", theme=gr.themes.Soft(), css=CUSTOM_CSS) as demo:
-    gr.Markdown("# 🎬 MBench 视频标注平台")
     with gr.Tabs():
-        # ═══════════════ MBench-A Pairwise ═══════════════
-        with gr.Tab("MBench-A 对比 (World Models)"):
-            gr.Markdown(
-                "## 🌍 MBench-A — 世界模型记忆能力评测\n\n"
-                "比较两个世界模型生成的长视频（~25 秒），评估相机转走再转回来后的记忆一致性。\n\n"
-                "**视频 A/B 的模型身份已匿名随机分配。请对每个维度独立判断。**"
-            )
             a_stats = gr.Markdown("")
             a_state = gr.State({})
             with gr.Row():
-                a_name = gr.Textbox(label="标注员名字", placeholder="例如: charlie", scale=4)
                 a_login = gr.Button("开始标注", variant="primary", scale=1)
             a_status = gr.Markdown("")
-            # Auxiliary info (mask image / camera GIF + caption / instructions)
             a_aux = gr.HTML("")
-            # Video pair
             with gr.Row(equal_height=True):
                 with gr.Column(scale=1, min_width=360):
                     gr.Markdown("### 视频 A")
-                    a_video_left = gr.HTML("<p>请先登录。</p>")
                 with gr.Column(scale=1, min_width=360):
                     gr.Markdown("### 视频 B")
-                    a_video_right = gr.HTML("<p>请先登录。</p>")
-            # Task info
-            a_meta = gr.Markdown("")
-            # Multi-dimension questions (max 6, dynamically shown/hidden)
             gr.Markdown("---\n### 请对以下每个维度分别判断：")
             a_q1 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 1", visible=False)
             a_q2 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 2", visible=False)
@@ -869,22 +770,17 @@ with gr.Blocks(title="MBench 标注", theme=gr.themes.Soft(), css=CUSTOM_CSS) as
             a_q4 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 4", visible=False)
             a_q5 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 5", visible=False)
             a_q6 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 6", visible=False)
             a_note = gr.Textbox(label="备注（可选）", lines=1)
             with gr.Row():
                 a_submit = gr.Button("✅ 提交并下一组", variant="primary")
                 a_skip = gr.Button("⏭️ 跳过")
-            # Wiring
-            a_all_outs = [a_state, a_status, a_aux, a_video_left, a_video_right, a_meta,
-                          a_q1, a_q2, a_q3, a_q4, a_q5, a_q6, a_note, a_stats]
-            a_login.click(mbench_a_start, [a_name, a_state], a_all_outs)
-            a_name.submit(mbench_a_start, [a_name, a_state], a_all_outs)
-            a_submit.click(mbench_a_submit,
-                           [a_state, a_q1, a_q2, a_q3, a_q4, a_q5, a_q6, a_note], a_all_outs)
-            a_skip.click(mbench_a_skip, [a_state], a_all_outs)
 # ---------------------------------------------------------------------------
 # Video proxy
@@ -899,7 +795,6 @@ if __name__ == "__main__":
     _video_client = httpx.AsyncClient(timeout=30.0, follow_redirects=True)
     async def _do_proxy(upstream: str, request: Request):
-        """Generic proxy for HF video/asset URLs."""
         req_headers = {}
         if (rng := request.headers.get("range")):
             req_headers["range"] = rng
@@ -910,13 +805,13 @@ if __name__ == "__main__":
             )
         except Exception as e:
             raise HTTPException(502, f"upstream fetch failed: {e}")
-        passthrough_headers = {}
         for h in ("content-type", "content-length", "accept-ranges",
                   "content-range", "etag", "last-modified"):
             if h in upstream_resp.headers:
-                passthrough_headers[h] = upstream_resp.headers[h]
-        passthrough_headers.setdefault("content-type", "video/mp4")
-        passthrough_headers["cache-control"] = "public, max-age=300"
         async def _body():
             try:
@@ -924,43 +819,31 @@ if __name__ == "__main__":
                     yield chunk
             finally:
                 await upstream_resp.aclose()
-        return StreamingResponse(_body(), status_code=upstream_resp.status_code, headers=passthrough_headers)
-    async def _proxy_video(model: str, task_id: str, request: Request):
-        """Proxy MBench-V videos."""
-        if model not in MODELS or task_id not in TASK_BY_ID:
-            raise HTTPException(404, "unknown (model, task_id)")
-        upstream = _hf_video_url(model, task_id)
         return await _do_proxy(upstream, request)
-    async def _proxy_mbench_a_video(model: str, category: str, sample_id: str, request: Request):
-        """Proxy MBench-A videos."""
-        if model not in MBENCH_A_MODELS:
-            raise HTTPException(404, f"unknown model: {model}")
-        upstream = _mbench_a_hf_video_url(model, category, sample_id)
         return await _do_proxy(upstream, request)
-    _orig_create_app = _GradioApp.create_app
-    def _patched_create_app(*args, **kwargs):
-        app = _orig_create_app(*args, **kwargs)
-        # MBench-V video proxy
-        app.add_api_route(
-            "/video/{model}/{task_id}.mp4",
-            _proxy_video,
-            methods=["GET", "HEAD"],
-            include_in_schema=False,
-        )
-        # MBench-A video proxy
-        app.add_api_route(
-            "/video_a/{model}/{category}/{sample_id}/left_then_right.mp4",
-            _proxy_mbench_a_video,
-            methods=["GET", "HEAD"],
-            include_in_schema=False,
-        )
-        print("[mbench-ann] video proxy routes registered (MBench-V + MBench-A)")
         return app
-    _GradioApp.create_app = staticmethod(_patched_create_app)
     demo.queue(default_concurrency_limit=16).launch(ssr_mode=False)

 """
+MBench Annotation Space (NEW) — adapted for MBench-V-new + MBench-A-New.
+Tabs:
+  1. MBench-V Binary  ─ "该视频是否出现了记忆问题？" (单视频, 1 标注员/任务)
+  2. MBench-V Pairwise ─ 双视频, 5 维度对比 (3 标注员/任务)
+  3. MBench-A Pairwise ─ 双视频, ≤6 维度对比 (3 标注员/任务)
+Data sources:
+  - Videos: streamed from studyOverflow/TempMemoryData (MBench-V-new + MBench-A-New).
+  - Task pools: sampling/new_task_pools.json
+  - Sample metadata: sample.json under MBench-{V,A}-New/samples/{subset}/{sid}/
+  - Annotation sink: annotations-new/ on the dataset repo (CommitScheduler, 5 min cadence).
+Notes:
+  - All paths use the new structure (subset names: environment/object/human/causal).
+  - Old annotations in annotations/ are preserved; this app writes only to annotations-new/.
 """
 from __future__ import annotations
 # ---------------------------------------------------------------------------
 DATASET_REPO = "studyOverflow/TempMemoryData"
 HF_TOKEN = os.environ.get("HF_TOKEN")
+V_MODELS = ["causal_forcing", "self_forcing", "cosmos", "helios",
+            "longlive", "memflow", "skyreels", "longcat"]
+A_MODELS = ["hy_worldplay", "infinite_world", "lingbot_world",
+            "matrix_game_2", "matrix_game_3", "yume"]
 ANN_DIR = Path("annotations_local")
 ANN_DIR.mkdir(exist_ok=True)
 PROCESS_ID = uuid.uuid4().hex[:8]
+ANN_FILE_V_BINARY = ANN_DIR / f"v_binary_{PROCESS_ID}.jsonl"
+ANN_FILE_V_PAIRWISE = ANN_DIR / f"v_pairwise_{PROCESS_ID}.jsonl"
+ANN_FILE_A_PAIRWISE = ANN_DIR / f"a_pairwise_{PROCESS_ID}.jsonl"
 COMMIT_INTERVAL_MIN = 5
 PENDING_TIMEOUT_SEC = 30 * 60
+V_BINARY_ANNOTATORS_PER_TASK = 1
+V_PAIRWISE_ANNOTATORS_PER_TASK = 3
+A_PAIRWISE_ANNOTATORS_PER_TASK = 3
 # ---------------------------------------------------------------------------
+# Load task pools
 # ---------------------------------------------------------------------------
+def _load_pools() -> dict:
+    local = Path(__file__).parent / "sampling" / "new_task_pools.json"
+    if local.exists():
         with open(local, encoding="utf-8") as f:
             return json.load(f)
+    raise RuntimeError(f"Task pool not found at {local}")
+POOLS = _load_pools()
+V_BINARY_TASKS: list[dict] = POOLS["v_binary"]["tasks"]
+V_PAIRWISE_TASKS: list[dict] = POOLS["v_pairwise"]["tasks"]
+A_PAIRWISE_TASKS: list[dict] = (POOLS["a_pairwise"]["tasks"]
+                                 + POOLS["a_pairwise"]["quality_control_tasks"])
+V_BINARY_BY_ID = {t["task_id"]: t for t in V_BINARY_TASKS}
+V_PAIRWISE_BY_ID = {t["task_id"]: t for t in V_PAIRWISE_TASKS}
+A_PAIRWISE_BY_ID = {t["task_id"]: t for t in A_PAIRWISE_TASKS}
+print(f"[ann-new] V binary tasks: {len(V_BINARY_TASKS)}")
+print(f"[ann-new] V pairwise tasks: {len(V_PAIRWISE_TASKS)}")
+print(f"[ann-new] A pairwise tasks: {len(A_PAIRWISE_TASKS)}")
 # ---------------------------------------------------------------------------
+# Sample metadata cache (sample.json)
 # ---------------------------------------------------------------------------
+_sample_cache: dict[tuple[str, str, str], dict] = {}
+_sample_cache_lock = threading.Lock()
+def _load_sample_meta(dataset: str, subset: str, sample_id: str) -> dict:
+    key = (dataset, subset, sample_id)
+    with _sample_cache_lock:
+        if key in _sample_cache:
+            return _sample_cache[key]
+    if dataset == "mbenchv":
+        path = f"MBench-V-new/samples/{subset}/{sample_id}/sample.json"
+    else:
+        path = f"MBench-A-New/samples/{subset}/{sample_id}/sample.json"
+    try:
+        local = hf_hub_download(DATASET_REPO, path, repo_type="dataset", token=HF_TOKEN)
+        with open(local, encoding="utf-8") as f:
+            data = json.load(f)
+    except Exception as e:
+        print(f"[ann-new] sample.json load failed for {key}: {e}")
+        data = {}
+    with _sample_cache_lock:
+        _sample_cache[key] = data
+    return data
 # ---------------------------------------------------------------------------
+# Video URL helpers (proxy)
 # ---------------------------------------------------------------------------
+def _v_video_proxy_url(model: str, subset: str, sample_id: str) -> str:
+    return f"/video_v/{model}/{subset}/{sample_id}.mp4"
+def _v_video_hf_url(model: str, subset: str, sample_id: str) -> str:
     return hf_hub_url(
         DATASET_REPO,
+        filename=f"MBench-V-new/models/{model}/outputs/{subset}/{sample_id}/text/video.mp4",
         repo_type="dataset",
     )
+def _a_video_proxy_url(model: str, subset: str, sample_id: str, condition_id: str) -> str:
+    return f"/video_a/{model}/{subset}/{sample_id}/{condition_id}.mp4"
+def _a_video_hf_url(model: str, subset: str, sample_id: str, condition_id: str) -> str:
     return hf_hub_url(
         DATASET_REPO,
+        filename=f"MBench-A-New/models/{model}/outputs/{subset}/{sample_id}/{condition_id}/video.mp4",
         repo_type="dataset",
     )
+def _a_asset_hf_url(path: str) -> str:
+    """Reuse old MBench-A asset directory (camera diagrams + mask viz)."""
+    return hf_hub_url(DATASET_REPO, filename=f"MBench-A/assets/{path}", repo_type="dataset")
 def _render_video_html(url: str) -> str:
     return (
     )
 # ---------------------------------------------------------------------------
+# CommitScheduler → annotations-new/
 # ---------------------------------------------------------------------------
 scheduler: CommitScheduler | None = None
         repo_id=DATASET_REPO,
         repo_type="dataset",
         folder_path=str(ANN_DIR),
+        path_in_repo="annotations-new",
         every=COMMIT_INTERVAL_MIN,
         token=HF_TOKEN,
         private=False,
     )
 # ---------------------------------------------------------------------------
+# Load historical annotations (from annotations-new/)
 # ---------------------------------------------------------------------------
+def _fetch_annotations_new() -> list[dict]:
+    records = []
     try:
         api = HfApi(token=HF_TOKEN)
         files = api.list_repo_files(repo_id=DATASET_REPO, repo_type="dataset")
     except Exception:
         return records
+    jsonls = [p for p in files if p.startswith("annotations-new/") and p.endswith(".jsonl")]
     for path in jsonls:
         try:
+            local = hf_hub_download(repo_id=DATASET_REPO, filename=path,
+                                    repo_type="dataset", token=HF_TOKEN)
             with open(local, encoding="utf-8") as f:
                 for line in f:
                     line = line.strip()
             pass
     return records
+HISTORICAL = _fetch_annotations_new()
+print(f"[ann-new] historical records loaded: {len(HISTORICAL)}")
 # ---------------------------------------------------------------------------
 # Shared state
 STATE_LOCK = threading.Lock()
+# Each: task_id -> set of annotators who completed it
+V_BINARY_COMPLETED: dict[str, set[str]] = defaultdict(set)
+V_PAIRWISE_COMPLETED: dict[str, set[str]] = defaultdict(set)
+A_PAIRWISE_COMPLETED: dict[str, set[str]] = defaultdict(set)
 for r in HISTORICAL:
+    t = r.get("type")
+    tid = r.get("task_id")
+    ann = r.get("annotator")
+    if not (tid and ann):
+        continue
+    if t == "v_binary" and tid in V_BINARY_BY_ID:
+        V_BINARY_COMPLETED[tid].add(ann)
+    elif t == "v_pairwise" and tid in V_PAIRWISE_BY_ID:
+        V_PAIRWISE_COMPLETED[tid].add(ann)
+    elif t == "a_pairwise" and tid in A_PAIRWISE_BY_ID:
+        A_PAIRWISE_COMPLETED[tid].add(ann)
+V_BINARY_PENDING: dict[str, tuple[str, float]] = {}
+V_PAIRWISE_PENDING: dict[str, tuple[str, float]] = {}
+A_PAIRWISE_PENDING: dict[str, tuple[str, float]] = {}
+print(f"[ann-new] V binary: {sum(len(v) for v in V_BINARY_COMPLETED.values())} annotations on {len(V_BINARY_COMPLETED)} tasks")
+print(f"[ann-new] V pairwise: {sum(len(v) for v in V_PAIRWISE_COMPLETED.values())} on {len(V_PAIRWISE_COMPLETED)} tasks")
+print(f"[ann-new] A pairwise: {sum(len(v) for v in A_PAIRWISE_COMPLETED.values())} on {len(A_PAIRWISE_COMPLETED)} tasks")
 # ---------------------------------------------------------------------------
+# Helpers
 # ---------------------------------------------------------------------------
+def _reap_expired(pending):
     now = time.time()
+    expired = [k for k, (_, ts) in pending.items() if now - ts > PENDING_TIMEOUT_SEC]
     for k in expired:
+        pending.pop(k, None)
+def _append(record: dict, ann_file: Path):
     line = json.dumps(record, ensure_ascii=False)
     if scheduler is not None:
         with scheduler.lock:
         with ann_file.open("a", encoding="utf-8") as f:
             f.write(line + "\n")
+def _format_caption(meta: dict) -> str:
+    """Render caption(_segments) as readable text."""
+    if not meta:
+        return ""
+    if meta.get("caption"):
+        return meta["caption"]
+    segs = meta.get("caption_segments")
+    if segs:
+        return "\n\n".join(f"— 第 {i}/{len(segs)} 段 —\n{s}" for i, s in enumerate(segs, 1))
+    return ""
 # ---------------------------------------------------------------------------
+# V Binary
 # ---------------------------------------------------------------------------
+def v_binary_start(annotator: str, state: dict):
     annotator = (annotator or "").strip()
     if not annotator:
+        return state, "<p>请输入名字</p>", "", "", "⚠️", ""
+    order = list(range(len(V_BINARY_TASKS)))
     random.shuffle(order)
+    n_done = sum(1 for v in V_BINARY_COMPLETED.values()
+                 if annotator in v)
+    state = {"annotator": annotator, "order": order, "idx": 0,
+             "current": None, "count": n_done}
+    return _v_binary_next(state)
+def _v_binary_next(state):
     annotator = state["annotator"]
     order = state["order"]
     idx = state.get("idx", 0)
     with STATE_LOCK:
+        _reap_expired(V_BINARY_PENDING)
         while idx < len(order):
+            task = V_BINARY_TASKS[order[idx]]
+            tid = task["task_id"]
+            if len(V_BINARY_COMPLETED.get(tid, set())) >= V_BINARY_ANNOTATORS_PER_TASK:
+                idx += 1; continue
+            if annotator in V_BINARY_COMPLETED.get(tid, set()):
+                idx += 1; continue
+            if tid in V_BINARY_PENDING and V_BINARY_PENDING[tid][0] != annotator:
+                idx += 1; continue
+            V_BINARY_PENDING[tid] = (annotator, time.time())
             state["idx"] = idx
+            state["current"] = tid
+            model = task["model_id"]
+            subset = task["subset"]
+            sid = task["sample_id"]
+            video_html = _render_video_html(_v_video_proxy_url(model, subset, sid))
+            meta = _load_sample_meta("mbenchv", subset, sid)
+            prompt = _format_caption(meta)
+            info = (f"**模型**: `{model}` | **子集**: `{subset}` | "
+                    f"**sample**: `{sid[:24]}...` | **已提交**: {state['count']}")
+            n_done = sum(1 for v in V_BINARY_COMPLETED.values() if v)
+            stats = f"全局进度: {n_done}/{len(V_BINARY_TASKS)} ({100*n_done/len(V_BINARY_TASKS):.1f}%)"
+            return state, video_html, info, prompt, "✅ 已加载", stats
         state["current"] = None
+        return state, "<p>🎉 全部完成！</p>", "", "", "完成", ""
+def v_binary_submit(state, verdict, note):
     if not state or not state.get("current"):
+        return state, "<p>请先登录</p>", "", "", "⚠️", "", "否", ""
+    tid = state["current"]
+    task = V_BINARY_BY_ID[tid]
     record = {
+        "type": "v_binary",
         "timestamp": time.time(),
         "annotator": state["annotator"],
+        "task_id": tid,
+        "dataset_id": "mbenchv",
+        "model_id": task["model_id"],
+        "subset": task["subset"],
+        "sample_id": task["sample_id"],
+        "condition_id": "text",
+        "item_id": f'{task["subset"]}:{task["sample_id"]}:text',
         "memory_issue": verdict == "是",
         "verdict": verdict,
         "note": (note or "").strip(),
     }
+    _append(record, ANN_FILE_V_BINARY)
     with STATE_LOCK:
+        V_BINARY_PENDING.pop(tid, None)
+        V_BINARY_COMPLETED[tid].add(state["annotator"])
     state["count"] = state.get("count", 0) + 1
     state["idx"] = state["idx"] + 1
     state["current"] = None
+    res = _v_binary_next(state)
+    return res[0], res[1], res[2], res[3], f"✅ 已提交 {state['count']}", res[5], "否", ""
+def v_binary_skip(state):
     if not state or not state.get("current"):
+        return state, "", "", "", "⚠️", "", "否", ""
+    tid = state["current"]
     with STATE_LOCK:
+        V_BINARY_PENDING.pop(tid, None)
     state["idx"] = state["idx"] + 1
     state["current"] = None
+    res = _v_binary_next(state)
+    return res[0], res[1], res[2], res[3], "⏭️ 已跳过", res[5], "否", ""
 # ---------------------------------------------------------------------------
+# V Pairwise
 # ---------------------------------------------------------------------------
+def v_pairwise_start(annotator: str, state: dict):
     annotator = (annotator or "").strip()
     if not annotator:
+        empty = gr.update(visible=False, value="差不多")
+        return (state, "⚠️ 请输入名字", "", "", "", "",
+                empty, empty, empty, empty, empty, "", "")
+    n_done = sum(1 for v in V_PAIRWISE_COMPLETED.values() if annotator in v)
+    order = list(range(len(V_PAIRWISE_TASKS)))
     random.shuffle(order)
+    state = {"annotator": annotator, "order": order, "idx": 0,
+             "current": None, "swapped": False, "count": n_done}
+    return _v_pairwise_next(state)
+def _v_pairwise_next(state):
     annotator = state["annotator"]
     order = state["order"]
     idx = state.get("idx", 0)
     with STATE_LOCK:
+        _reap_expired(V_PAIRWISE_PENDING)
         while idx < len(order):
+            task = V_PAIRWISE_TASKS[order[idx]]
+            tid = task["task_id"]
+            if len(V_PAIRWISE_COMPLETED.get(tid, set())) >= V_PAIRWISE_ANNOTATORS_PER_TASK:
+                idx += 1; continue
+            if annotator in V_PAIRWISE_COMPLETED.get(tid, set()):
+                idx += 1; continue
+            if tid in V_PAIRWISE_PENDING and V_PAIRWISE_PENDING[tid][0] != annotator:
+                idx += 1; continue
+            V_PAIRWISE_PENDING[tid] = (annotator, time.time())
             state["idx"] = idx
+            state["current"] = tid
+            ma, mb = task["model_a"], task["model_b"]
             if random.random() < 0.5:
+                left, right = ma, mb; state["swapped"] = False
             else:
+                left, right = mb, ma; state["swapped"] = True
+            subset = task["subset"]; sid = task["sample_id"]
+            video_l = _render_video_html(_v_video_proxy_url(left, subset, sid))
+            video_r = _render_video_html(_v_video_proxy_url(right, subset, sid))
+            meta = _load_sample_meta("mbenchv", subset, sid)
+            prompt = _format_caption(meta)
+            dim_questions = task["dimension_questions"]
+            dimensions = task["dimensions"]
+            q_updates = []
+            for i in range(5):
+                if i < len(dimensions):
+                    qtext = dim_questions.get(dimensions[i], dimensions[i])
+                    q_updates.append(gr.update(visible=True, label=qtext, value="差不多"))
+                else:
+                    q_updates.append(gr.update(visible=False, value="差不多"))
+            subset_emoji = {"environment": "🏞️", "object": "🎯", "human": "👤", "causal": "⚡"}
+            info = (f"**子集**: {subset_emoji.get(subset, '')} {subset} | "
+                    f"**已提交**: {state['count']}")
+            n_done = sum(1 for v in V_PAIRWISE_COMPLETED.values()
+                         if len(v) >= V_PAIRWISE_ANNOTATORS_PER_TASK)
+            stats = f"全局进度: {n_done}/{len(V_PAIRWISE_TASKS)} 任务完成"
+            return (state, "✅ 已加载", video_l, video_r, info, prompt,
+                    *q_updates, "", stats)
         state["current"] = None
+        empty = gr.update(visible=False, value="差不多")
+        return (state, "🎉 全部完成", "", "", "全部完成", "",
+                empty, empty, empty, empty, empty, "", "")
+def v_pairwise_submit(state, q1, q2, q3, q4, q5, note):
     if not state or not state.get("current"):
+        empty = gr.update(visible=False, value="差不多")
+        return (state, "⚠️ 请先登录", "", "", "", "",
+                empty, empty, empty, empty, empty, "", "")
+    tid = state["current"]
+    task = V_PAIRWISE_BY_ID[tid]
+    swapped = state["swapped"]
+    ma, mb = task["model_a"], task["model_b"]
+    verdicts = [q1, q2, q3, q4, q5]
+    dim_results = {}
+    for i, dim in enumerate(task["dimensions"]):
+        v = verdicts[i]
+        if v == "A更好":
+            winner = mb if swapped else ma
+        elif v == "B更好":
+            winner = ma if swapped else mb
+        else:
+            winner = "tie"
+        dim_results[dim] = winner
     record = {
+        "type": "v_pairwise",
         "timestamp": time.time(),
         "annotator": state["annotator"],
         "task_id": tid,
+        "dataset_id": "mbenchv",
+        "subset": task["subset"],
+        "sample_id": task["sample_id"],
+        "condition_id": "text",
+        "model_a": ma,
+        "model_b": mb,
+        "item_a": f'{task["subset"]}:{task["sample_id"]}:text|{ma}',
+        "item_b": f'{task["subset"]}:{task["sample_id"]}:text|{mb}',
+        "dimensions": dim_results,
         "swapped": swapped,
         "note": (note or "").strip(),
     }
+    _append(record, ANN_FILE_V_PAIRWISE)
     with STATE_LOCK:
+        V_PAIRWISE_PENDING.pop(tid, None)
+        V_PAIRWISE_COMPLETED[tid].add(state["annotator"])
     state["count"] = state.get("count", 0) + 1
     state["idx"] = state["idx"] + 1
     state["current"] = None
+    return _v_pairwise_next(state)
+def v_pairwise_skip(state):
     if not state or not state.get("current"):
+        empty = gr.update(visible=False, value="差不多")
+        return (state, "⚠️ 请先登录", "", "", "", "",
+                empty, empty, empty, empty, empty, "", "")
+    tid = state["current"]
     with STATE_LOCK:
+        V_PAIRWISE_PENDING.pop(tid, None)
     state["idx"] = state["idx"] + 1
     state["current"] = None
+    return _v_pairwise_next(state)
 # ---------------------------------------------------------------------------
+# A Pairwise (adapted from old app, with new paths)
 # ---------------------------------------------------------------------------
+def _render_a_aux(task: dict) -> str:
+    subset = task["subset"]
+    box = 'class="aux-info-box"'
+    motion = task.get("camera_motion", "left_then_right")
+    motion_desc = task.get("camera_motion_description", motion)
+    gif_url = _a_asset_hf_url(f"camera_diagrams/{motion}.gif")
+    camera_html = (
+        f'<div style="flex:0 0 200px">'
+        f'<p><b>🎬 预期相机运动</b></p>'
+        f'<p style="margin:0 0 8px">{motion_desc}</p>'
+        f'<img src="{gif_url}" style="width:180px">'
+        f'</div>'
+    )
+    caption = task.get("caption", "")
+    caption_html = (
+        f'<div style="flex:1;min-width:250px">'
+        f'<p><b>📝 场景描述</b></p>'
+        f'<p style="font-size:14px;line-height:1.5">{caption}</p>'
+        f'</div>'
+    ) if caption else ""
+    if subset == "object":
+        sample_id = task["sample_id"]
+        # Use new mask_viz path inside MBench-A/assets/mask_viz still works
+        mask_url = _a_asset_hf_url(f"mask_viz/{sample_id}.png")
+        return (
+            f'<div {box}>'
+            f'<p><b>🎯 请关注画面中被标注（高亮）的物体</b></p>'
+            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start;margin-top:8px">'
+            f'<div style="flex:1;min-width:300px">'
+            f'<img src="{mask_url}" style="max-width:100%;max-height:280px"></div>'
+            f'{camera_html}{caption_html}</div></div>'
+        )
+    elif subset == "human":
+        return (
+            f'<div {box}>'
+            f'<p><b>👤 请关注视频中的人物</b>：观察人物离开画面再回来后，面部和外观是否保持一致。</p>'
+            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start;margin-top:8px">'
+            f'{camera_html}{caption_html}</div></div>'
+        )
+    elif subset == "causal":
+        return (
+            f'<div {box}>'
+            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start">'
+            f'{camera_html}{caption_html}</div></div>'
+        )
+    else:  # environment
+        return (
+            f'<div {box}>'
+            f'<p><b>🏞️ 请关注整体场景</b>：观察相机转回来后，场景的布局/风格/光照是否保持一致。</p>'
+            f'<div style="display:flex;gap:16px;flex-wrap:wrap;align-items:flex-start;margin-top:8px">'
+            f'{camera_html}{caption_html}</div></div>'
+        )
+def a_start(annotator: str, state: dict):
     annotator = (annotator or "").strip()
     if not annotator:
+        empty = gr.update(visible=False, value="差不多")
         return (state, "⚠️ 请输入名字", "", "", "", "",
+                empty, empty, empty, empty, empty, empty, "", "")
+    n_done = sum(1 for v in A_PAIRWISE_COMPLETED.values() if annotator in v)
+    order = list(range(len(A_PAIRWISE_TASKS)))
     random.shuffle(order)
+    state = {"annotator": annotator, "order": order, "idx": 0,
+             "current": None, "swapped": False, "count": n_done}
+    return _a_next(state)
+def _a_next(state):
     annotator = state["annotator"]
     order = state["order"]
     idx = state.get("idx", 0)
     with STATE_LOCK:
+        _reap_expired(A_PAIRWISE_PENDING)
         while idx < len(order):
+            task = A_PAIRWISE_TASKS[order[idx]]
             tid = task["task_id"]
+            if len(A_PAIRWISE_COMPLETED.get(tid, set())) >= A_PAIRWISE_ANNOTATORS_PER_TASK:
+                idx += 1; continue
+            if annotator in A_PAIRWISE_COMPLETED.get(tid, set()):
+                idx += 1; continue
+            if tid in A_PAIRWISE_PENDING and A_PAIRWISE_PENDING[tid][0] != annotator:
+                idx += 1; continue
+            A_PAIRWISE_PENDING[tid] = (annotator, time.time())
             state["idx"] = idx
+            state["current"] = tid
+            ma, mb = task["model_a"], task["model_b"]
             if random.random() < 0.5:
+                left, right = ma, mb; state["swapped"] = False
             else:
+                left, right = mb, ma; state["swapped"] = True
+            subset = task["subset"]; sid = task["sample_id"]
+            motion = task.get("camera_motion", "left_then_right")
+            cond = f"{motion}_25s"
+            video_l = _render_video_html(_a_video_proxy_url(left, subset, sid, cond))
+            video_r = _render_video_html(_a_video_proxy_url(right, subset, sid, cond))
+            aux = _render_a_aux(task)
             dimensions = task["dimensions"]
+            dim_q = task.get("dimension_questions", {})
             q_updates = []
             for i in range(6):
                 if i < len(dimensions):
+                    qtext = dim_q.get(dimensions[i], dimensions[i])
+                    q_updates.append(gr.update(visible=True, label=qtext, value="差不多"))
                 else:
                     q_updates.append(gr.update(visible=False, value="差不多"))
+            subset_emoji = {"environment": "🏞️", "object": "🎯", "human": "👤", "causal": "⚡"}
+            info = f"**子集**: {subset_emoji.get(subset, '')} {subset} | **已提交**: {state['count']}"
+            n_done = sum(1 for v in A_PAIRWISE_COMPLETED.values()
+                         if len(v) >= A_PAIRWISE_ANNOTATORS_PER_TASK)
+            stats = f"全局进度: {n_done}/{len(A_PAIRWISE_TASKS)} 任务完成"
+            return (state, "✅ 已加载", aux, video_l, video_r, info,
                     *q_updates, "", stats)
+        state["current"] = None
+        empty = gr.update(visible=False, value="差不多")
+        return (state, "🎉 全部完成", "", "", "", "全部完成",
+                empty, empty, empty, empty, empty, empty, "", "")
+def a_submit(state, q1, q2, q3, q4, q5, q6, note):
+    if not state or not state.get("current"):
+        empty = gr.update(visible=False, value="差不多")
         return (state, "⚠️ 请先登录", "", "", "", "",
+                empty, empty, empty, empty, empty, empty, "", "")
+    tid = state["current"]
+    task = A_PAIRWISE_BY_ID[tid]
     swapped = state["swapped"]
+    ma, mb = task["model_a"], task["model_b"]
+    verdicts = [q1, q2, q3, q4, q5, q6]
     dim_results = {}
+    for i, dim in enumerate(task["dimensions"]):
         v = verdicts[i]
         if v == "A更好":
+            winner = mb if swapped else ma
         elif v == "B更好":
+            winner = ma if swapped else mb
         else:
             winner = "tie"
+        dim_results[dim] = winner
+    motion = task.get("camera_motion", "left_then_right")
+    cond = f"{motion}_25s"
     record = {
+        "type": "a_pairwise",
         "timestamp": time.time(),
         "annotator": state["annotator"],
         "task_id": tid,
+        "dataset_id": "mbencha",
         "subset": task["subset"],
         "sample_id": task["sample_id"],
+        "condition_id": cond,
+        "model_a": ma,
+        "model_b": mb,
+        "item_a": f'{task["subset"]}:{task["sample_id"]}:{cond}|{ma}',
+        "item_b": f'{task["subset"]}:{task["sample_id"]}:{cond}|{mb}',
+        "camera_motion": motion,
         "dimensions": dim_results,
         "swapped": swapped,
         "note": (note or "").strip(),
     }
+    _append(record, ANN_FILE_A_PAIRWISE)
     with STATE_LOCK:
+        A_PAIRWISE_PENDING.pop(tid, None)
+        A_PAIRWISE_COMPLETED[tid].add(state["annotator"])
     state["count"] = state.get("count", 0) + 1
     state["idx"] = state["idx"] + 1
+    state["current"] = None
+    return _a_next(state)
+def a_skip(state):
+    if not state or not state.get("current"):
+        empty = gr.update(visible=False, value="差不多")
         return (state, "⚠️ 请先登录", "", "", "", "",
+                empty, empty, empty, empty, empty, empty, "", "")
+    tid = state["current"]
     with STATE_LOCK:
+        A_PAIRWISE_PENDING.pop(tid, None)
     state["idx"] = state["idx"] + 1
+    state["current"] = None
+    return _a_next(state)
 # ---------------------------------------------------------------------------
 # UI
 CUSTOM_CSS = """
 #prompt_box textarea { height: 300px !important; overflow-y: auto !important; }
 .aux-info-box {
+    background: #e3e8ef !important; color: #111 !important;
+    padding: 14px !important; border-radius: 8px !important;
+    margin-bottom: 12px !important; border: 1px solid #b0b8c4 !important;
 }
+.aux-info-box * { color: #111 !important; }
+.aux-info-box img { border: 1px solid #999; border-radius: 4px; }
 """
+with gr.Blocks(title="MBench 标注 (NEW)", theme=gr.themes.Soft(), css=CUSTOM_CSS) as demo:
+    gr.Markdown("# 🎬 MBench 视频标注平台 (新结构)")
     with gr.Tabs():
+        # ───── V Binary ─────
+        with gr.Tab("MBench-V Binary"):
+            gr.Markdown("## 📺 MBench-V — 单视频记忆问题判断\n\n"
+                        "请观看视频并阅读 prompt，判断是否出现了**记忆问题**（场景/物体/人物前后不一致）。")
+            vb_stats = gr.Markdown("")
+            vb_state = gr.State({})
+            with gr.Row():
+                vb_name = gr.Textbox(label="标注员名字", placeholder="例如: charlie", scale=4)
+                vb_login = gr.Button("开始标注", variant="primary", scale=1)
+            vb_status = gr.Markdown("")
+            vb_video = gr.HTML("<p>请先登录</p>")
+            vb_info = gr.Markdown("")
+            vb_prompt = gr.Textbox(label="Prompt / 文本描述", lines=10, elem_id="prompt_box")
+            vb_verdict = gr.Radio(["是", "否"], value="否", label="是否出现了记忆问题？")
+            vb_note = gr.Textbox(label="备注（可选）", lines=1)
+            with gr.Row():
+                vb_submit = gr.Button("✅ 提交并下一组", variant="primary")
+                vb_skip = gr.Button("⏭️ 跳过")
+            vb_outs = [vb_state, vb_video, vb_info, vb_prompt, vb_status, vb_stats, vb_verdict, vb_note]
+            vb_login.click(v_binary_start, [vb_name, vb_state],
+                           [vb_state, vb_video, vb_info, vb_prompt, vb_status, vb_stats])
+            vb_name.submit(v_binary_start, [vb_name, vb_state],
+                           [vb_state, vb_video, vb_info, vb_prompt, vb_status, vb_stats])
+            vb_submit.click(v_binary_submit, [vb_state, vb_verdict, vb_note], vb_outs)
+            vb_skip.click(v_binary_skip, [vb_state], vb_outs)
+        # ───── V Pairwise ─────
+        with gr.Tab("MBench-V Pairwise"):
+            gr.Markdown("## 🎬 MBench-V — 双视频对比 (5 维度)\n\n"
+                        "比较两个 T2V 模型生成的视频，从 5 个维度独立判断哪个更好。")
+            vp_stats = gr.Markdown("")
+            vp_state = gr.State({})
+            with gr.Row():
+                vp_name = gr.Textbox(label="标注员名字", scale=4)
+                vp_login = gr.Button("开始标注", variant="primary", scale=1)
+            vp_status = gr.Markdown("")
+            with gr.Row(equal_height=True):
+                with gr.Column(scale=1, min_width=360):
+                    gr.Markdown("### 视频 A")
+                    vp_video_l = gr.HTML("<p>请先登录</p>")
+                with gr.Column(scale=1, min_width=360):
+                    gr.Markdown("### 视频 B")
+                    vp_video_r = gr.HTML("<p>请先登录</p>")
+            vp_info = gr.Markdown("")
+            vp_prompt = gr.Textbox(label="Prompt / 文本描述", lines=8, elem_id="prompt_box")
+            gr.Markdown("---\n### 请对以下每个维度分别判断：")
+            vp_q1 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 1", visible=False)
+            vp_q2 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 2", visible=False)
+            vp_q3 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 3", visible=False)
+            vp_q4 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 4", visible=False)
+            vp_q5 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 5", visible=False)
+            vp_note = gr.Textbox(label="备注（可选）", lines=1)
+            with gr.Row():
+                vp_submit = gr.Button("✅ 提交并下一组", variant="primary")
+                vp_skip = gr.Button("⏭️ 跳过")
+            vp_outs = [vp_state, vp_status, vp_video_l, vp_video_r, vp_info, vp_prompt,
+                       vp_q1, vp_q2, vp_q3, vp_q4, vp_q5, vp_note, vp_stats]
+            vp_login.click(v_pairwise_start, [vp_name, vp_state], vp_outs)
+            vp_name.submit(v_pairwise_start, [vp_name, vp_state], vp_outs)
+            vp_submit.click(v_pairwise_submit,
+                            [vp_state, vp_q1, vp_q2, vp_q3, vp_q4, vp_q5, vp_note], vp_outs)
+            vp_skip.click(v_pairwise_skip, [vp_state], vp_outs)
+        # ───── A Pairwise ─────
+        with gr.Tab("MBench-A Pairwise"):
+            gr.Markdown("## 🌍 MBench-A — 世界模型双视频对比 (≤6 维度)\n\n"
+                        "比较两个世界模型的长视频（25 秒），评估相机运动结束后的记忆一致性。")
             a_stats = gr.Markdown("")
             a_state = gr.State({})
             with gr.Row():
+                a_name = gr.Textbox(label="标注员名字", scale=4)
                 a_login = gr.Button("开始标注", variant="primary", scale=1)
             a_status = gr.Markdown("")
             a_aux = gr.HTML("")
             with gr.Row(equal_height=True):
                 with gr.Column(scale=1, min_width=360):
                     gr.Markdown("### 视频 A")
+                    a_video_l = gr.HTML("<p>请先登录</p>")
                 with gr.Column(scale=1, min_width=360):
                     gr.Markdown("### 视频 B")
+                    a_video_r = gr.HTML("<p>请先登录</p>")
+            a_info = gr.Markdown("")
             gr.Markdown("---\n### 请对以下每个维度分别判断：")
             a_q1 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 1", visible=False)
             a_q2 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 2", visible=False)
             a_q4 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 4", visible=False)
             a_q5 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 5", visible=False)
             a_q6 = gr.Radio(["A更好", "差不多", "B更好"], value="差不多", label="维度 6", visible=False)
             a_note = gr.Textbox(label="备注（可选）", lines=1)
             with gr.Row():
                 a_submit = gr.Button("✅ 提交并下一组", variant="primary")
                 a_skip = gr.Button("⏭️ 跳过")
+            a_outs = [a_state, a_status, a_aux, a_video_l, a_video_r, a_info,
+                      a_q1, a_q2, a_q3, a_q4, a_q5, a_q6, a_note, a_stats]
+            a_login.click(a_start, [a_name, a_state], a_outs)
+            a_name.submit(a_start, [a_name, a_state], a_outs)
+            a_submit.click(a_submit,
+                           [a_state, a_q1, a_q2, a_q3, a_q4, a_q5, a_q6, a_note], a_outs)
+            a_skip.click(a_skip, [a_state], a_outs)
 # ---------------------------------------------------------------------------
 # Video proxy
     _video_client = httpx.AsyncClient(timeout=30.0, follow_redirects=True)
     async def _do_proxy(upstream: str, request: Request):
         req_headers = {}
         if (rng := request.headers.get("range")):
             req_headers["range"] = rng
             )
         except Exception as e:
             raise HTTPException(502, f"upstream fetch failed: {e}")
+        passthrough = {}
         for h in ("content-type", "content-length", "accept-ranges",
                   "content-range", "etag", "last-modified"):
             if h in upstream_resp.headers:
+                passthrough[h] = upstream_resp.headers[h]
+        passthrough.setdefault("content-type", "video/mp4")
+        passthrough["cache-control"] = "public, max-age=300"
         async def _body():
             try:
                     yield chunk
             finally:
                 await upstream_resp.aclose()
+        return StreamingResponse(_body(), status_code=upstream_resp.status_code, headers=passthrough)
+    async def _proxy_v_video(model: str, subset: str, sample_id: str, request: Request):
+        sid = sample_id.replace(".mp4", "")
+        if model not in V_MODELS:
+            raise HTTPException(404, f"unknown V model: {model}")
+        upstream = _v_video_hf_url(model, subset, sid)
         return await _do_proxy(upstream, request)
+    async def _proxy_a_video(model: str, subset: str, sample_id: str, condition_id: str, request: Request):
+        cond = condition_id.replace(".mp4", "")
+        if model not in A_MODELS:
+            raise HTTPException(404, f"unknown A model: {model}")
+        upstream = _a_video_hf_url(model, subset, sample_id, cond)
         return await _do_proxy(upstream, request)
+    _orig = _GradioApp.create_app
+    def _patched(*args, **kwargs):
+        app = _orig(*args, **kwargs)
+        app.add_api_route("/video_v/{model}/{subset}/{sample_id}",
+                          _proxy_v_video, methods=["GET", "HEAD"], include_in_schema=False)
+        app.add_api_route("/video_a/{model}/{subset}/{sample_id}/{condition_id}",
+                          _proxy_a_video, methods=["GET", "HEAD"], include_in_schema=False)
+        print("[ann-new] video proxy routes registered")
         return app
+    _GradioApp.create_app = staticmethod(_patched)
     demo.queue(default_concurrency_limit=16).launch(ssr_mode=False)

sampling/new_task_pools.json ADDED Viewed

The diff for this file is too large to render. See raw diff