TokenWang commited on 6 days ago

Commit

6fb95ae

verified ·

1 Parent(s): cc1a2eb

Upload folder using huggingface_hub

Browse files

Files changed (36) hide show

.gitattributes +27 -0
docs/assets/att.png +3 -0
docs/assets/benchmarks/generation.webp +3 -0
docs/assets/benchmarks/interleaved.webp +0 -0
docs/assets/benchmarks/understanding.webp +3 -0
docs/assets/lightllm_x2v.png +3 -0
docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp +3 -0
docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp +3 -0
docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp +0 -0
docs/assets/showcases/t2i_general/1_1_artistic_02.webp +3 -0
docs/assets/showcases/t2i_general/1_1_dense_artistic_09.webp +0 -0
docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp +3 -0
docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp +3 -0
docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp +3 -0
docs/assets/showcases/t2i_general/1_1_face_hd_13.webp +3 -0
docs/assets/showcases/t2i_general/1_1_face_hd_17.webp +3 -0
docs/assets/showcases/t2i_general/1_1_landscape_06.webp +3 -0
docs/assets/showcases/t2i_general/1_1_landscape_07.webp +3 -0
docs/assets/showcases/t2i_general/9_16_artistic_07.webp +3 -0
docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp +3 -0
docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp +3 -0
docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp +3 -0
docs/assets/showcases/t2i_general/9_16_human_pose_11.webp +3 -0
docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp +0 -0
docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp +3 -0
docs/assets/showcases/t2i_reasoning/1_reasoning.png +3 -0
docs/assets/showcases/t2i_reasoning/2_reasoning.png +3 -0
docs/assets/showcases/t2i_reasoning/3_reasoning.png +3 -0
docs/assets/showcases/t2i_reasoning/4_reasoning.png +3 -0
docs/assets/showcases/t2i_reasoning/5_reasoning.png +3 -0
docs/assets/showcases/t2i_reasoning/6_reasoning.png +3 -0
docs/assets/showcases/t2i_reasoning/7_reasoning.png +3 -0
docs/assets/teaser.png +2 -2
docs/deployment.md +143 -0
docs/inference_infrastructure.md +81 -2
docs/showcases.md +142 -8

.gitattributes CHANGED Viewed

@@ -70,3 +70,30 @@ docs/assets/showcases/t2i_infographic/0009_1536x2720.webp filter=lfs diff=lfs me
 docs/assets/showcases/vqa/agentic_case.webp filter=lfs diff=lfs merge=lfs -text
 docs/assets/showcases/vqa/general_case.webp filter=lfs diff=lfs merge=lfs -text
 docs/assets/teaser.png filter=lfs diff=lfs merge=lfs -text

 docs/assets/showcases/vqa/agentic_case.webp filter=lfs diff=lfs merge=lfs -text
 docs/assets/showcases/vqa/general_case.webp filter=lfs diff=lfs merge=lfs -text
 docs/assets/teaser.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/att.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/benchmarks/generation.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/benchmarks/understanding.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/lightllm_x2v.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_artistic_02.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_face_hd_13.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_face_hd_17.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_landscape_06.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/1_1_landscape_07.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/9_16_artistic_07.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/9_16_human_pose_11.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_reasoning/1_reasoning.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_reasoning/2_reasoning.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_reasoning/3_reasoning.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_reasoning/4_reasoning.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_reasoning/5_reasoning.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_reasoning/6_reasoning.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/t2i_reasoning/7_reasoning.png filter=lfs diff=lfs merge=lfs -text

docs/assets/att.png ADDED Viewed

Git LFS Details

SHA256: 49dbc80d1d9c1246165f7fd46d0755e4d90f25eaa225c365c530adf834549745
Pointer size: 131 Bytes
Size of remote file: 596 kB

docs/assets/benchmarks/generation.webp ADDED Viewed

Git LFS Details

SHA256: 93138201821f7204ccba4aa1f43ed575ae4ae7b2845c08e8367310abfc7686f1
Pointer size: 131 Bytes
Size of remote file: 304 kB

docs/assets/benchmarks/interleaved.webp ADDED Viewed

docs/assets/benchmarks/understanding.webp ADDED Viewed

Git LFS Details

SHA256: 90a3c3e959be798bf196a7f78bef07e5e11e67cfa41afd2bf61a69d552e760ac
Pointer size: 131 Bytes
Size of remote file: 331 kB

docs/assets/lightllm_x2v.png ADDED Viewed

Git LFS Details

SHA256: dbd353eabdd421ac405b055b7bd8b87caedd841d06a50bf988254e5e7b7121fc
Pointer size: 131 Bytes
Size of remote file: 471 kB

docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp ADDED Viewed

Git LFS Details

SHA256: 536fe6c50ef38930f9ead01e818cebaf54643dd783bc9a61b5656d3537def34c
Pointer size: 131 Bytes
Size of remote file: 143 kB

docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp ADDED Viewed

Git LFS Details

SHA256: 3464dae226b4b33ed883464153d31af9fd3eafe1ca2a9a00494a3d3379123745
Pointer size: 131 Bytes
Size of remote file: 123 kB

docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp ADDED Viewed

docs/assets/showcases/t2i_general/1_1_artistic_02.webp ADDED Viewed

Git LFS Details

SHA256: 7b142318b19257d67e072de13e589ce2aa0c51ef43077d31057b4c53813b4570
Pointer size: 131 Bytes
Size of remote file: 129 kB

docs/assets/showcases/t2i_general/1_1_dense_artistic_09.webp ADDED Viewed

docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp ADDED Viewed

Git LFS Details

SHA256: 39f7ec6c871586517bc98cb769118c05ae9f0b7734594f74cf0788fbafb192b4
Pointer size: 131 Bytes
Size of remote file: 183 kB

docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp ADDED Viewed

Git LFS Details

SHA256: 03d3316935c7a1f32dbb4d4f53242ae3b85a02f67dee3d7f30a8420afb69ca2f
Pointer size: 131 Bytes
Size of remote file: 218 kB

docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp ADDED Viewed

Git LFS Details

SHA256: 48b2d12e739b38d2eed3de6d8c616b85fb30559876b7cf2546d9b40a8943e71c
Pointer size: 131 Bytes
Size of remote file: 234 kB

docs/assets/showcases/t2i_general/1_1_face_hd_13.webp ADDED Viewed

Git LFS Details

SHA256: abf821c069b6903457eb8323a3b019163a1997e9635d79c63b55a770b49f262e
Pointer size: 131 Bytes
Size of remote file: 201 kB

docs/assets/showcases/t2i_general/1_1_face_hd_17.webp ADDED Viewed

Git LFS Details

SHA256: 53c830617404b27eb4f5730279b32f6717d84600b2d58f58e36ed668bea7675e
Pointer size: 131 Bytes
Size of remote file: 116 kB

docs/assets/showcases/t2i_general/1_1_landscape_06.webp ADDED Viewed

Git LFS Details

SHA256: b5995e12a5fb572614c64c1bf42b3cb790772e7c5d4b1a87605ef9df7232b57a
Pointer size: 131 Bytes
Size of remote file: 272 kB

docs/assets/showcases/t2i_general/1_1_landscape_07.webp ADDED Viewed

Git LFS Details

SHA256: 79f90dcf90b8e3d3e4779776ef30cb5a7d7d000914db65da4c2b257ad6b80d58
Pointer size: 131 Bytes
Size of remote file: 286 kB

docs/assets/showcases/t2i_general/9_16_artistic_07.webp ADDED Viewed

Git LFS Details

SHA256: ca8653292ed7d84dbf0137b4c8da9d03129ece6f93ffaff8f495b69bddb2e190
Pointer size: 131 Bytes
Size of remote file: 107 kB

docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp ADDED Viewed

Git LFS Details

SHA256: c69ef067fec9a866888d94c5822c85e2159c83e424dbe20c81b20b6c6b140760
Pointer size: 131 Bytes
Size of remote file: 120 kB

docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp ADDED Viewed

Git LFS Details

SHA256: c58543eca91130d1b388edd608d597de6a7f29350129add8ea05410c9adc4d84
Pointer size: 131 Bytes
Size of remote file: 160 kB

docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp ADDED Viewed

Git LFS Details

SHA256: 868fb5bde86d4f48d05403866e78ac4a9b058efd1dba10f388e553edb5ea294d
Pointer size: 131 Bytes
Size of remote file: 442 kB

docs/assets/showcases/t2i_general/9_16_human_pose_11.webp ADDED Viewed

Git LFS Details

SHA256: 6817a79b297833698bbd078c170ceab30a970a3f2e49826c6c5f65e5e5b6d192
Pointer size: 131 Bytes
Size of remote file: 111 kB

docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp ADDED Viewed

docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp ADDED Viewed

Git LFS Details

SHA256: 556c2e1676084f3cddbd74fe60d2a104ce76551ad8c78de6b4ffd5a1257aa767
Pointer size: 131 Bytes
Size of remote file: 136 kB

docs/assets/showcases/t2i_reasoning/1_reasoning.png ADDED Viewed

Git LFS Details

SHA256: 01f118ba5cb52b1d9d9cff5da0ffc5d9e6a76ab343d4de6d6ff486ece857a44e
Pointer size: 132 Bytes
Size of remote file: 1.37 MB

docs/assets/showcases/t2i_reasoning/2_reasoning.png ADDED Viewed

Git LFS Details

SHA256: acb80dfedf8f0b5233c6a50ff7dec7d99fb5162a635c2972f0bbf016ccdd732f
Pointer size: 132 Bytes
Size of remote file: 1.36 MB

docs/assets/showcases/t2i_reasoning/3_reasoning.png ADDED Viewed

Git LFS Details

SHA256: b58e4fed10ac41a9f4d243c0315f1736e5aeac45bd738002c919ae70e9f9230f
Pointer size: 132 Bytes
Size of remote file: 1.19 MB

docs/assets/showcases/t2i_reasoning/4_reasoning.png ADDED Viewed

Git LFS Details

SHA256: 8fad9ca5022f190b37b7f3d576047bd32608b4774390b859606a1403eae16a3e
Pointer size: 132 Bytes
Size of remote file: 2.11 MB

docs/assets/showcases/t2i_reasoning/5_reasoning.png ADDED Viewed

Git LFS Details

SHA256: 190aada770284379dcc2e2c60b04ac7a3bf95d5d3107e1a9fec346ca25cb732c
Pointer size: 132 Bytes
Size of remote file: 1.05 MB

docs/assets/showcases/t2i_reasoning/6_reasoning.png ADDED Viewed

Git LFS Details

SHA256: eedf078c84dd37b49f9c0158efa2419c8193f16dcd10904b9b02b45c95878a98
Pointer size: 132 Bytes
Size of remote file: 1.2 MB

docs/assets/showcases/t2i_reasoning/7_reasoning.png ADDED Viewed

Git LFS Details

SHA256: 8f2e443df8a418ef66522bc554d434ceb4043000ee6569b7c925791a4561b066
Pointer size: 132 Bytes
Size of remote file: 1.1 MB

docs/assets/teaser.png CHANGED Viewed

Git LFS Details

SHA256: 0716a056fe9fbc8dec1a892cf0c53f8e29511d9c16907c1cff56fce2427f2f88
Pointer size: 131 Bytes
Size of remote file: 351 kB

Git LFS Details

SHA256: 5e6c914c410db78e194a39d54c8664e3d13c21a8d5c404b232b5a0ea944a2a2a
Pointer size: 132 Bytes
Size of remote file: 1.87 MB

docs/deployment.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# LightLLM + LightX2V Deployment
+<p align="center">
+  <a href="../README.md">← Back to main README</a>
+</p>
+This guide provides a practical deployment flow for serving SenseNova-U1 with
+LightLLM + LightX2V using the Docker image
+`lightx2v/lightllm_lightx2v:20260407`.
+## 1) Pull and enter the Docker image
+```bash
+docker pull lightx2v/lightllm_lightx2v:20260407
+docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash
+```
+## 2) Clone runtime dependencies inside the container
+The image may not include the latest source trees. Clone both repositories and
+pin LightLLM to the validated branch:
+```bash
+git clone https://github.com/ModelTC/LightX2V.git
+git clone https://github.com/ModelTC/LightLLM.git
+cd LightLLM
+git checkout neo_plus_clean
+```
+## 3) X2I-related arguments
+When enabling image generation in the same API server, use the following flags:
+- `--enable_multimodal_x2i`
+  Enable image generation capability.
+- `--x2i_server_used_gpus`
+  Number of GPUs reserved for the X2I generation server.
+- `--x2i_server_deploy_mode {colocate,separate}`
+  - `colocate`: understanding and generation share the same visible GPU pool.
+  - `separate`: understanding and generation are deployed as separate services, and
+    can use different GPU sets.
+- `--x2i_use_naive_impl`
+  Use the native/naive PyTorch backend for X2I (debugging/testing only, not for
+  production throughput).
+## 4) Deployment modes
+### Mode A: `colocate` (single service, shared GPU pool)
+Use this mode for quick validation and simpler operations. The LLM understanding
+path (`--tp`) and X2I generation path (`--x2i_server_used_gpus`) consume resources
+from the same visible GPUs.
+Example (2 GPUs total):
+- understanding path: `tp=2`
+- generation path: `cfg=2` (configured in `neopp_dense_parallel_cfg.json`)
+```bash
+PYTHONPATH=/workspace/LightX2V/ \
+python -m lightllm.server.api_server \
+  --model_dir $MODEL_DIR \
+  --enable_multimodal_x2i \
+  --x2i_server_deploy_mode colocate \
+  --x2i_server_used_gpus 2 \
+  --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --max_req_total_len 65536 \
+  --mem_fraction 0.75 \
+  --tp 2
+```
+### Mode B: `separate` (understanding and generation decoupled)
+`separate` is conceptually similar to PD-style decoupling in LLM serving: split
+different stages onto different GPU groups so a long stage does not block the
+short stage.
+For multimodal serving, image generation is usually the long stage, while
+understanding is short and lightweight. Separating them allows understanding
+requests to keep flowing even when generation workers are busy.
+Recommended deployment profiles:
+1. **Default profile (continuity-first): Understanding `tp=1` + Generation 1 GPU**
+   - Understanding: `--tp 1`
+   - Generation: `--x2i_server_used_gpus 1`
+   - Use as the baseline profile for mixed workloads. It keeps the pipeline simple
+     while avoiding head-of-line blocking between understanding and generation.
+2. **Understanding-expanded profile: Understanding `tp=2` + Generation 1 GPU**
+   - Understanding: `--tp 2`
+   - Generation: `--x2i_server_used_gpus 1`
+   - Use when complex prompts or high understanding QPS become the bottleneck.
+3. **Generation-expanded profile: Understanding `tp=1/2` + Generation parallel**
+   - Understanding: `--tp 1` or `--tp 2`
+   - Generation option A (2 GPUs): `--x2i_server_used_gpus 2` +
+     `/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json`
+   - Generation option B (4 GPUs): `--x2i_server_used_gpus 4` +
+     `/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json`
+   - Use when generation latency/throughput dominates (most common scaling path).
+Example launch (separate mode in API server):
+```bash
+PYTHONPATH=/workspace/LightX2V/ \
+python -m lightllm.server.api_server \
+  --model_dir $MODEL_DIR \
+  --enable_multimodal_x2i \
+  --x2i_server_deploy_mode separate \
+  --x2i_server_used_gpus 1 \
+  --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --max_req_total_len 65536 \
+  --mem_fraction 0.75 \
+  --tp 2
+```
+## 5) Quantization
+`separate` mode also enables independent quantization strategies for
+understanding and generation.
+Because understanding and generation are decoupled, you can tune quality/latency
+for each path independently:
+1. **Understanding FP16/BF16 + Generation FP8**
+   - Understanding: no quantization flag (keep default precision)
+   - Generation: use FP8 generation config, for example
+     `/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
+   - Recommended as the default quantized profile for production.
+2. **Understanding FP8 + Generation FP8**
+   - Understanding: add `--quant_type fp8w8a8`
+   - Generation: use FP8 generation config
+     `/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
+   - Use when GPU memory/throughput is the primary constraint.
+Notes:
+- `--quant_type fp8w8a8` controls quantization on the understanding path.
+- Generation-side precision is controlled by `--x2v_gen_model_config`.

docs/inference_infrastructure.md CHANGED Viewed

@@ -6,7 +6,86 @@
 This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
 ## Overview
-A unified model that handles both understanding and generation has fundamentally asymmetric compute profiles: the understanding side is prefill-heavy and KV-bound, while the generation side is iterative-decode-heavy and bandwidth/compute-bound. Serving them on a single, monolithic engine leaves both sides suboptimal. SenseNova-U1's inference stack therefore adopts a **disaggregated** design, with specialized optimizations on each side and a lightweight transfer layer in between.

 This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
 ## Overview
+SenseNova-U1 is exposed as one unified multimodal model, but the understanding and generation paths exhibit different execution shapes in production. They tend to prefer different scheduling policies, parallelization strategies, and resource ratios, rather than a single shared serving configuration. When both are coupled inside one monolithic runtime, these choices become unnecessarily tied together, which can leave both paths operating away from their respective optimal points.
+To avoid this coupling, SenseNova-U1 adopts a **disaggregated** architecture:
+- **LightLLM** for understanding, text streaming, and control flow
+- **LightX2V** for image generation
+These two engines exchange generation state through pinned shared memory and high-performance transfer kernels. The handoff is lightweight, while each side can still run with its own optimal execution policy.
+![LightLLM + LightX2V decoupled architecture](./assets/lightllm_x2v.png)
+This design provides practical benefits in production:
+- Independent parallelism (for example, understanding with `TP=2`, generation
+  with `CFG=2` or `SP=2`).
+- Independent resource allocation (different GPU counts and memory budgets).
+- Independent scaling for text-heavy vs. image-heavy traffic.
+- Better operational isolation and simpler performance tuning.
+The same architecture can be deployed in two modes, depending on your hardware budget and traffic pattern:
+- **Separate**: LightLLM and LightX2V run on different GPU groups.
+- **Colocate**: LightLLM and LightX2V run as separate processes on the same GPU.
+In most production setups, `Separate` is the default choice because it gives clearer bottleneck control and independent scaling. `Colocate` is useful for quick validation, generation-heave scenes, or smaller GPU setups.
+### Attention for Multimodal Prefill of Neo
+Neo's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FA3 codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
+Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
+To preserve the causal-triangle speedup whenever possible, the kernel makes the decision per M-block. It OR-reduces the image_token_tag values inside the current block: if the block contains no image token, it keeps the standard causal K-range; if the block contains image tokens, it extends the K-range to cover the required image span. As a result, pure-text blocks still follow the normal causal path, while only the relevant blocks pay the extra work needed by the hybrid mask.
+![Neo multimodal attention behavior](./assets/att.png)
+The overhead therefore does not depend on a fixed ratio, but on how image tokens are distributed across the sequence and across M-block boundaries. When image rows are concentrated in only part of the sequence, the extra work is correspondingly localized. For text-only requests, image_token_tag is empty, and the kernel falls back to vanilla FA3 with no additional overhead.
+The benchmark below compares two implementations for Neo-style multimodal prefill:
+- **Triton implementation**: easier to migrate into existing codebases, with lower
+  integration cost and faster iteration.
+- **FA3 implementation**: higher absolute performance on supported hardware.
+| batch | max_seq_len | image_token_num | triton (ms) | fa3 (ms) | speedup (×) |
+| ----: | ----------: | --------------: | -------------: | ----------: | ----------: |
+|     8 |        4096 |              88 |          1.95 |       0.81 |       **2.41×** |
+|     8 |        8192 |             171 |          6.55 |       2.68 |       **2.45×** |
+|     8 |       65536 |             150 |         43.30 |      14.95 |       **2.90×** |
+|    16 |        4096 |             379 |          4.12 |       1.68 |       **2.46×** |
+|    16 |        8192 |             246 |         17.76 |       7.40 |       **2.40×** |
+|    16 |       65536 |             206 |        107.74 |      33.66 |       **3.20×** |
+|    32 |        4096 |             726 |          8.46 |       3.46 |       **2.44×** |
+|    32 |        8192 |             536 |         31.74 |      13.24 |       **2.40×** |
+|    32 |       65536 |             417 |        171.00 |      58.26 |       **2.94×** |
+|    64 |        4096 |            1170 |         16.08 |       6.88 |       **2.34×** |
+|    64 |        8192 |            1177 |         55.48 |      22.91 |       **2.42×** |
+|    64 |       65536 |            1291 |        348.89 |     124.82 |       **2.80×** |
+|   128 |        4096 |            2057 |         30.89 |      12.53 |       **2.47×** |
+|   128 |        8192 |            2196 |        104.73 |      43.22 |       **2.42×** |
+|   128 |       65536 |            2205 |        706.60 |     241.67 |       **2.92×** |
+### Deployment
+For a concise deployment runbook (Docker image, startup command, and API tests),
+see [`deployment.md`](./deployment.md).
+### Generation Performance
+The table below is the benchmark template for **2048x2048** image generation.
+Fill in measured numbers for each machine and deployment profile.
+| Machine Type | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
+| ---------- | ----------------- | --------------------------: | ---------------------: |
+| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
+| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
+| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
+| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
+In Neo, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.

docs/showcases.md CHANGED Viewed

@@ -11,11 +11,73 @@ open the full-resolution render.
 ## Text-to-Image
-The full 3 × 3 infographic grid (landscape / square / portrait) lives in
-the main [README § Text-to-Image](../README.md#text-to-image-infographics)
-to keep things in one place. Source prompts are in
 [`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl).
 ---
 ## Image Editing
@@ -25,13 +87,82 @@ edit instruction rendered along the bottom of each compare tile. The same
 unified model handles single-image attribute / style / relighting edits
 and multi-reference (subject + accessory + pose) composition.
 Reproducible prompts are in
 [`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl).
-| | | |
-| :---: | :---: | :---: |
-| [<img alt="editing compare 0001" src="./assets/showcases/editing/0001_2048x2048_compare.webp">](./assets/showcases/editing/0001_2048x2048_compare.webp) | [<img alt="editing compare 0002" src="./assets/showcases/editing/0002_2048x2048_compare.webp">](./assets/showcases/editing/0002_2048x2048_compare.webp) | [<img alt="editing compare 0003" src="./assets/showcases/editing/0003_2048x2048_compare.webp">](./assets/showcases/editing/0003_2048x2048_compare.webp) |
-| [<img alt="editing compare 0004" src="./assets/showcases/editing/0004_2048x2048_compare.webp">](./assets/showcases/editing/0004_2048x2048_compare.webp) | [<img alt="editing compare 0005" src="./assets/showcases/editing/0005_2400x1696_compare.webp">](./assets/showcases/editing/0005_2400x1696_compare.webp) | |
 ---
@@ -40,7 +171,10 @@ Reproducible prompts are in
 Each case below is a single rendered response from `model.interleave_gen`:
 the model first runs a `<think>...</think>` reasoning block that produces
 intermediate images, then emits the final interleaved text-and-image
-answer
 | |

 ## Text-to-Image
+The main table presents the complete n × 3 grid layouts, covering landscape, square, and portrait formats at different resolutions.
+#### 🖼️ *Text-to-Image (General)*
+Reproducible prompts are in
+[`examples/t2i/data/samples.jsonl`](../examples/t2i/data/samples.jsonl).
+| | | |
+| :---: | :---: | :---: |
+| [<img width="300" alt="t2i general dense face hd 07" src="./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
+| [<img width="260" alt="t2i general face hd 13" src="./assets/showcases/t2i_general/1_1_face_hd_13.webp">](./assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./assets/showcases/t2i_general/1_1_face_hd_17.webp">](./assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
+| [<img width="260" alt="t2i general landscape 06" src="./assets/showcases/t2i_general/1_1_landscape_06.webp">](./assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./assets/showcases/t2i_general/1_1_landscape_07.webp">](./assets/showcases/t2i_general/1_1_landscape_07.webp) |
+| [<img width="200" alt="t2i general portrait artistic 02 a" src="./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./assets/showcases/t2i_general/9_16_human_pose_11.webp">](./assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./assets/showcases/t2i_general/9_16_artistic_07.webp">](./assets/showcases/t2i_general/9_16_artistic_07.webp) |
+| [<img width="200" alt="t2i general portrait text rendering 02" src="./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
+#### 🖼️ *Text-to-Image (Reasoning)*
+Reproducible prompts are in
+[`examples/t2i/data/sample_reasoning.jsonl`](../examples/t2i/data/sample_reasoning.jsonl).
+<table>
+  <tr>
+    <th style="width: 20%">Original Text</th>
+    <th style="width: 50%">Reasoning Process</th>
+    <th style="width: 30%">Resulting Image</th>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">The playful craft that embodies Russian cultural charm</td>
+    <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is the matryoshka, identified as a Russian craft. Essential modifiers include playful and Russian cultural charm. The relation is that the craft embodies the charm. 2. <b>Reasoning Process:</b> The prompt identifies the matryoshka, the iconic Russian wooden doll set. 3. <b>Establish the frame:</b> The frame captures the matryoshka set in the foreground. The composition focuses on the Russian craft to show the playful nature of the doll set. 4. <b>Set the lighting and color:</b> Lighting illuminates the matryoshka to reveal the Russian cultural charm. The color palette supports the playful craft aesthetic. 5. <b>Lock the style:</b> The style emphasizes the wooden nature of the Russian craft. The finish reflects the charm of the matryoshka. 6. <b>Explicit Prompt:</b> A set of three colorful, hand-painted wooden matryoshka dolls arranged by size on a rustic wooden table, bright floral patterns, soft natural light.</div></td>
+    <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/1_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
+    <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
+    <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">A gigantic bubble in the immediate foreground with a small town barely visible inside</td>
+    <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The prompt requires a gigantic bubble positioned in the immediate foreground. A small town must be visible inside the bubble. 2. <b>Reasoning Process:</b> The bubble acts as a transparent lens, potentially distorting the town's appearance due to its curvature. 3. <b>Establish the frame:</b> The shot scale is an extreme close-up, placing the bubble so close it fills the majority of the frame. The focus plane is set on the bubble's surface, rendering the background outside the town soft and out of focus. 4. <b>Set the lighting and color:</b> Natural daylight creates bright specular highlights on the upper surface of the bubble. The color palette is dominated by the cool transparency of the bubble and the muted tones of the town within. 5. <b>Lock the style:</b> The overall image style is photorealistic with a high level of detail on the bubble's texture. The rendering captures the glossy finish and physical properties of the bubble accurately. 6. <b>Explicit Prompt:</b> An extreme close-up of a massive, transparent soap bubble floating in the immediate foreground, with a tiny, detailed miniature town visible inside the curvature, refracted and magnified by the surface tension, soft natural daylight, dreamlike, high detail, photorealistic texture.</div></td>
+    <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/3_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">A chocolate bar left in direct sunlight, highlighting the state of the chocolate</td>
+    <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a chocolate bar positioned in direct sunlight. The focus is on the state of the chocolate, specifically how the heat affects it. 2. <b>Reasoning Process:</b> Heat causes chocolate to melt, losing its structured shape and becoming a viscous, glossy liquid. 3. <b>Establish the frame:</b> The composition is a close-up shot that fills the frame with the chocolate bar to emphasize detail. The angle is slightly elevated to show the top surface and the pooling liquid clearly. 4. <b>Build the environment:</b> The chocolate bar rests on a generic surface that supports the object without distracting from the main subject. The background is blurred to keep attention on the foreground elements and the chocolate. 5. <b>Set the lighting and color:</b> Direct sunlight creates bright highlights on the melting chocolate, emphasizing its glossy texture. The lighting is warm and intense, casting distinct shadows and illuminating the rich brown colors of the liquid. 6. <b>Explicit Prompt:</b> A close-up of a melting chocolate bar on a surface, with the edges losing their defined shape and pooling into a glossy, viscous puddle under the heat of the sun.</div></td>
+    <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/6_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
+    <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
+    <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+</table>
+#### 🖼️ *Text-to-Image (Infographics)*
+Reproducible prompts are in
 [`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl).
+| | | |
+| :---: | :---: | :---: |
+| [<img width="300" alt="t2i landscape 0001" src="./assets/showcases/t2i_infographic/0001_2720x1536.webp">](./assets/showcases/t2i_infographic/0001_2720x1536.webp) | [<img width="300" alt="t2i landscape 0002" src="./assets/showcases/t2i_infographic/0002_2720x1536.webp">](./assets/showcases/t2i_infographic/0002_2720x1536.webp) | [<img width="300" alt="t2i landscape 0003" src="./assets/showcases/t2i_infographic/0003_2720x1536.webp">](./assets/showcases/t2i_infographic/0003_2720x1536.webp) |
+| [<img width="300" alt="t2i square 0004" src="./assets/showcases/t2i_infographic/0004_2048x2048.webp">](./assets/showcases/t2i_infographic/0004_2048x2048.webp) | [<img width="300" alt="t2i square 0005" src="./assets/showcases/t2i_infographic/0005_2048x2048.webp">](./assets/showcases/t2i_infographic/0005_2048x2048.webp) | [<img width="300" alt="t2i square 0006" src="./assets/showcases/t2i_infographic/0006_2048x2048.webp">](./assets/showcases/t2i_infographic/0006_2048x2048.webp) |
+| [<img width="200" alt="t2i portrait 0007" src="./assets/showcases/t2i_infographic/0007_1536x2720.webp">](./assets/showcases/t2i_infographic/0007_1536x2720.webp) | [<img width="200" alt="t2i portrait 0008" src="./assets/showcases/t2i_infographic/0008_1536x2720.webp">](./assets/showcases/t2i_infographic/0008_1536x2720.webp) | [<img width="200" alt="t2i portrait 0009" src="./assets/showcases/t2i_infographic/0009_1536x2720.webp">](./assets/showcases/t2i_infographic/0009_1536x2720.webp) |
 ---
 ## Image Editing
 unified model handles single-image attribute / style / relighting edits
 and multi-reference (subject + accessory + pose) composition.
+#### ✏️ *Image Editing (General)*
 Reproducible prompts are in
 [`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl).
+| | |
+| :---: | :---: |
+| <div align="center"><a href="../examples/editing/data/images/1.webp"><img width="180" alt="editing input 1" src="../examples/editing/data/images/1.webp"></a> <a href="../examples/editing/data/images/1_out.webp"><img width="180" alt="editing output 1" src="../examples/editing/data/images/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <div align="center"><a href="../examples/editing/data/images/3.webp"><img width="180" alt="editing input 3" src="../examples/editing/data/images/3.webp"></a> <a href="../examples/editing/data/images/3_out.webp"><img width="180" alt="editing output 3" src="../examples/editing/data/images/3_out.webp"></a><br><sub>在小狗头上放一个花环，并且把图片变为吉卜力风格。</sub></div> |
+| <div align="center"><a href="../examples/editing/data/images/2.webp"><img width="180" alt="editing input 2" src="../examples/editing/data/images/2.webp"></a> <a href="../examples/editing/data/images/2_out.webp"><img width="180" alt="editing output 2" src="../examples/editing/data/images/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <div align="center"><a href="../examples/editing/data/images/4.webp"><img width="180" alt="editing input 4" src="../examples/editing/data/images/4.webp"></a> <a href="../examples/editing/data/images/4_out.webp"><img width="180" alt="editing output 4" src="../examples/editing/data/images/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
+| <div align="center"><a href="../examples/editing/data/images/5.webp"><img width="180" alt="editing input 5" src="../examples/editing/data/images/5.webp"></a> <a href="../examples/editing/data/images/5_out.webp"><img width="180" alt="editing output 5" src="../examples/editing/data/images/5_out.webp"></a><br><sub>Turn the image into an American comic style.</sub></div> | <div align="center"><a href="../examples/editing/data/images/8.webp"><img width="180" alt="editing input 8" src="../examples/editing/data/images/8.webp"></a> <a href="../examples/editing/data/images/8_out.webp"><img width="180" alt="editing output 8" src="../examples/editing/data/images/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> |
+| <div align="center"><a href="../examples/editing/data/images/6.webp"><img width="180" alt="editing input 6" src="../examples/editing/data/images/6.webp"></a> <a href="../examples/editing/data/images/6_out.webp"><img width="180" alt="editing output 6" src="../examples/editing/data/images/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> | <div align="center"><a href="../examples/editing/data/images/7.webp"><img width="180" alt="editing input 7" src="../examples/editing/data/images/7.webp"></a> <a href="../examples/editing/data/images/7_out.webp"><img width="180" alt="editing output 7" src="../examples/editing/data/images/7_out.webp"></a><br><sub>Remove the person on the far right wearing a green skirt and a green top.</sub></div> |
+#### ✏️ *Image Editing (Reasoning)*
+Reproducible prompts are in
+[`examples/editing/data/samples_reasoning.jsonl`](../examples/editing/data/samples_reasoning.jsonl).
+<table>
+  <tr>
+    <th style="width: 20%">Original Text</th>
+    <th style="width: 30%">Original Image</th>
+    <th style="width: 20%">Reasoning Process</th>
+    <th style="width: 30%">Resulting Image</th>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">Draw what it will look like one hour later.</td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+    <td><div style="max-height: 200px; overflow-y: auto;">
+    1. <b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">Draw what it will look like immediately after someone stands up from sitting on it for a long time.</td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+    <td><div style="max-height: 200px; overflow-y: auto;">
+    1. <b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag’s shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">Draw an image showing the side view of the provided traffic cone.</td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+    <td><div style="max-height: 200px; overflow-y: auto;">
+    1. <b>Source Image Analysis:</b> The source image shows a 3D perspective view of a traffic cone. 2. <b>Instruction Understanding:</b> The instruction asks for a side view. 3. <b>Reasoning Process:</b> A side view of a standard traffic cone results in a triangular silhouette with a flat rectangular base. 4. <b>Expected Visual Changes:</b> The perspective is flattened into this 2D-like geometric profile. 5. <b>Elements to Preserve:</b> The cone's height and color should remain consistent with the original object. 6. <b>Explicit Edit Prompt:</b> Edit the perspective view into a flat side-profile silhouette of a triangle with a rectangular base, keeping the red color and proportions unchanged.</div></td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">Change the water to high-concentration saltwater</td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+    <td><div style="max-height: 200px; overflow-y: auto;">
+    1. <b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">What the fruit looks like when ripe in the picture</td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+    <td><div style="max-height: 200px; overflow-y: auto;">
+    1. <b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">Correct the unreasonable part in the image.</td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+    <td><div style="max-height: 200px; overflow-y: auto;">
+    1. <b>Source Image Analysis:</b> The source image shows a kettle pouring water onto a mug, but the stream is misaligned and missing the cup. 2. <b>Instruction Understanding:</b> The instruction asks to fix the physical inconsistency. 3. <b>Reasoning Process:</b> The water stream must be redirected to connect the spout to the mug, maintaining the trajectory of gravity. 4. <b>Expected Visual Changes:</b> The water stream will be redirected to connect the spout to the mug. 5. <b>Elements to Preserve:</b> The kettle, mug, and background must remain unchanged while the water path is corrected. 6. <b>Explicit Edit Prompt:</b> Draw a continuous water stream connecting the kettle spout to the mug, keeping the kettle, mug, and background unchanged.</div></td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+  <tr>
+    <td style="vertical-align: top;">Modify the matrix in the image to an upper triangular matrix</td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+    <td><div style="max-height: 200px; overflow-y: auto;">
+    1. <b>Source Image Analysis:</b> The source image shows a 2x2 matrix with values 1, 2, 3, and 4. 2. <b>Instruction Understanding:</b> The instruction asks to convert this to an upper triangular matrix. 3. <b>Reasoning Process:</b> By definition, an upper triangular matrix has zeros below the main diagonal, so the entry '3' must be changed to '0' while keeping '1', '2', and '4' as they are, and this modification satisfies the mathematical property requested. 4. <b>Expected Visual Changes:</b> The entry '3' in the lower-left position will be changed to '0'. 5. <b>Elements to Preserve:</b> The grid lines, the matrix structure, and the other entries must remain unchanged. 6. <b>Explicit Edit Prompt:</b> Change the '3' in the lower-left position to '0', while keeping the matrix structure and other entries unchanged.</div></td>
+    <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
+  </tr>
+</table>
 ---
 Each case below is a single rendered response from `model.interleave_gen`:
 the model first runs a `<think>...</think>` reasoning block that produces
 intermediate images, then emits the final interleaved text-and-image
+answer.
+Reproducible prompts are in
+[`examples/interleave/data/samples.jsonl`](../examples/interleave/data/sample.jsonl).
 | |