TokenWang commited on
Commit
6fb95ae
·
verified ·
1 Parent(s): cc1a2eb

Upload folder using huggingface_hub

Browse files
Files changed (36) hide show
  1. .gitattributes +27 -0
  2. docs/assets/att.png +3 -0
  3. docs/assets/benchmarks/generation.webp +3 -0
  4. docs/assets/benchmarks/interleaved.webp +0 -0
  5. docs/assets/benchmarks/understanding.webp +3 -0
  6. docs/assets/lightllm_x2v.png +3 -0
  7. docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp +3 -0
  8. docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp +3 -0
  9. docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp +0 -0
  10. docs/assets/showcases/t2i_general/1_1_artistic_02.webp +3 -0
  11. docs/assets/showcases/t2i_general/1_1_dense_artistic_09.webp +0 -0
  12. docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp +3 -0
  13. docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp +3 -0
  14. docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp +3 -0
  15. docs/assets/showcases/t2i_general/1_1_face_hd_13.webp +3 -0
  16. docs/assets/showcases/t2i_general/1_1_face_hd_17.webp +3 -0
  17. docs/assets/showcases/t2i_general/1_1_landscape_06.webp +3 -0
  18. docs/assets/showcases/t2i_general/1_1_landscape_07.webp +3 -0
  19. docs/assets/showcases/t2i_general/9_16_artistic_07.webp +3 -0
  20. docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp +3 -0
  21. docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp +3 -0
  22. docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp +3 -0
  23. docs/assets/showcases/t2i_general/9_16_human_pose_11.webp +3 -0
  24. docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp +0 -0
  25. docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp +3 -0
  26. docs/assets/showcases/t2i_reasoning/1_reasoning.png +3 -0
  27. docs/assets/showcases/t2i_reasoning/2_reasoning.png +3 -0
  28. docs/assets/showcases/t2i_reasoning/3_reasoning.png +3 -0
  29. docs/assets/showcases/t2i_reasoning/4_reasoning.png +3 -0
  30. docs/assets/showcases/t2i_reasoning/5_reasoning.png +3 -0
  31. docs/assets/showcases/t2i_reasoning/6_reasoning.png +3 -0
  32. docs/assets/showcases/t2i_reasoning/7_reasoning.png +3 -0
  33. docs/assets/teaser.png +2 -2
  34. docs/deployment.md +143 -0
  35. docs/inference_infrastructure.md +81 -2
  36. docs/showcases.md +142 -8
.gitattributes CHANGED
@@ -70,3 +70,30 @@ docs/assets/showcases/t2i_infographic/0009_1536x2720.webp filter=lfs diff=lfs me
70
  docs/assets/showcases/vqa/agentic_case.webp filter=lfs diff=lfs merge=lfs -text
71
  docs/assets/showcases/vqa/general_case.webp filter=lfs diff=lfs merge=lfs -text
72
  docs/assets/teaser.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  docs/assets/showcases/vqa/agentic_case.webp filter=lfs diff=lfs merge=lfs -text
71
  docs/assets/showcases/vqa/general_case.webp filter=lfs diff=lfs merge=lfs -text
72
  docs/assets/teaser.png filter=lfs diff=lfs merge=lfs -text
73
+ docs/assets/att.png filter=lfs diff=lfs merge=lfs -text
74
+ docs/assets/benchmarks/generation.webp filter=lfs diff=lfs merge=lfs -text
75
+ docs/assets/benchmarks/understanding.webp filter=lfs diff=lfs merge=lfs -text
76
+ docs/assets/lightllm_x2v.png filter=lfs diff=lfs merge=lfs -text
77
+ docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp filter=lfs diff=lfs merge=lfs -text
78
+ docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp filter=lfs diff=lfs merge=lfs -text
79
+ docs/assets/showcases/t2i_general/1_1_artistic_02.webp filter=lfs diff=lfs merge=lfs -text
80
+ docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp filter=lfs diff=lfs merge=lfs -text
81
+ docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp filter=lfs diff=lfs merge=lfs -text
82
+ docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp filter=lfs diff=lfs merge=lfs -text
83
+ docs/assets/showcases/t2i_general/1_1_face_hd_13.webp filter=lfs diff=lfs merge=lfs -text
84
+ docs/assets/showcases/t2i_general/1_1_face_hd_17.webp filter=lfs diff=lfs merge=lfs -text
85
+ docs/assets/showcases/t2i_general/1_1_landscape_06.webp filter=lfs diff=lfs merge=lfs -text
86
+ docs/assets/showcases/t2i_general/1_1_landscape_07.webp filter=lfs diff=lfs merge=lfs -text
87
+ docs/assets/showcases/t2i_general/9_16_artistic_07.webp filter=lfs diff=lfs merge=lfs -text
88
+ docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp filter=lfs diff=lfs merge=lfs -text
89
+ docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp filter=lfs diff=lfs merge=lfs -text
90
+ docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp filter=lfs diff=lfs merge=lfs -text
91
+ docs/assets/showcases/t2i_general/9_16_human_pose_11.webp filter=lfs diff=lfs merge=lfs -text
92
+ docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp filter=lfs diff=lfs merge=lfs -text
93
+ docs/assets/showcases/t2i_reasoning/1_reasoning.png filter=lfs diff=lfs merge=lfs -text
94
+ docs/assets/showcases/t2i_reasoning/2_reasoning.png filter=lfs diff=lfs merge=lfs -text
95
+ docs/assets/showcases/t2i_reasoning/3_reasoning.png filter=lfs diff=lfs merge=lfs -text
96
+ docs/assets/showcases/t2i_reasoning/4_reasoning.png filter=lfs diff=lfs merge=lfs -text
97
+ docs/assets/showcases/t2i_reasoning/5_reasoning.png filter=lfs diff=lfs merge=lfs -text
98
+ docs/assets/showcases/t2i_reasoning/6_reasoning.png filter=lfs diff=lfs merge=lfs -text
99
+ docs/assets/showcases/t2i_reasoning/7_reasoning.png filter=lfs diff=lfs merge=lfs -text
docs/assets/att.png ADDED

Git LFS Details

  • SHA256: 49dbc80d1d9c1246165f7fd46d0755e4d90f25eaa225c365c530adf834549745
  • Pointer size: 131 Bytes
  • Size of remote file: 596 kB
docs/assets/benchmarks/generation.webp ADDED

Git LFS Details

  • SHA256: 93138201821f7204ccba4aa1f43ed575ae4ae7b2845c08e8367310abfc7686f1
  • Pointer size: 131 Bytes
  • Size of remote file: 304 kB
docs/assets/benchmarks/interleaved.webp ADDED
docs/assets/benchmarks/understanding.webp ADDED

Git LFS Details

  • SHA256: 90a3c3e959be798bf196a7f78bef07e5e11e67cfa41afd2bf61a69d552e760ac
  • Pointer size: 131 Bytes
  • Size of remote file: 331 kB
docs/assets/lightllm_x2v.png ADDED

Git LFS Details

  • SHA256: dbd353eabdd421ac405b055b7bd8b87caedd841d06a50bf988254e5e7b7121fc
  • Pointer size: 131 Bytes
  • Size of remote file: 471 kB
docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp ADDED

Git LFS Details

  • SHA256: 536fe6c50ef38930f9ead01e818cebaf54643dd783bc9a61b5656d3537def34c
  • Pointer size: 131 Bytes
  • Size of remote file: 143 kB
docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp ADDED

Git LFS Details

  • SHA256: 3464dae226b4b33ed883464153d31af9fd3eafe1ca2a9a00494a3d3379123745
  • Pointer size: 131 Bytes
  • Size of remote file: 123 kB
docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp ADDED
docs/assets/showcases/t2i_general/1_1_artistic_02.webp ADDED

Git LFS Details

  • SHA256: 7b142318b19257d67e072de13e589ce2aa0c51ef43077d31057b4c53813b4570
  • Pointer size: 131 Bytes
  • Size of remote file: 129 kB
docs/assets/showcases/t2i_general/1_1_dense_artistic_09.webp ADDED
docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp ADDED

Git LFS Details

  • SHA256: 39f7ec6c871586517bc98cb769118c05ae9f0b7734594f74cf0788fbafb192b4
  • Pointer size: 131 Bytes
  • Size of remote file: 183 kB
docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp ADDED

Git LFS Details

  • SHA256: 03d3316935c7a1f32dbb4d4f53242ae3b85a02f67dee3d7f30a8420afb69ca2f
  • Pointer size: 131 Bytes
  • Size of remote file: 218 kB
docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp ADDED

Git LFS Details

  • SHA256: 48b2d12e739b38d2eed3de6d8c616b85fb30559876b7cf2546d9b40a8943e71c
  • Pointer size: 131 Bytes
  • Size of remote file: 234 kB
docs/assets/showcases/t2i_general/1_1_face_hd_13.webp ADDED

Git LFS Details

  • SHA256: abf821c069b6903457eb8323a3b019163a1997e9635d79c63b55a770b49f262e
  • Pointer size: 131 Bytes
  • Size of remote file: 201 kB
docs/assets/showcases/t2i_general/1_1_face_hd_17.webp ADDED

Git LFS Details

  • SHA256: 53c830617404b27eb4f5730279b32f6717d84600b2d58f58e36ed668bea7675e
  • Pointer size: 131 Bytes
  • Size of remote file: 116 kB
docs/assets/showcases/t2i_general/1_1_landscape_06.webp ADDED

Git LFS Details

  • SHA256: b5995e12a5fb572614c64c1bf42b3cb790772e7c5d4b1a87605ef9df7232b57a
  • Pointer size: 131 Bytes
  • Size of remote file: 272 kB
docs/assets/showcases/t2i_general/1_1_landscape_07.webp ADDED

Git LFS Details

  • SHA256: 79f90dcf90b8e3d3e4779776ef30cb5a7d7d000914db65da4c2b257ad6b80d58
  • Pointer size: 131 Bytes
  • Size of remote file: 286 kB
docs/assets/showcases/t2i_general/9_16_artistic_07.webp ADDED

Git LFS Details

  • SHA256: ca8653292ed7d84dbf0137b4c8da9d03129ece6f93ffaff8f495b69bddb2e190
  • Pointer size: 131 Bytes
  • Size of remote file: 107 kB
docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp ADDED

Git LFS Details

  • SHA256: c69ef067fec9a866888d94c5822c85e2159c83e424dbe20c81b20b6c6b140760
  • Pointer size: 131 Bytes
  • Size of remote file: 120 kB
docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp ADDED

Git LFS Details

  • SHA256: c58543eca91130d1b388edd608d597de6a7f29350129add8ea05410c9adc4d84
  • Pointer size: 131 Bytes
  • Size of remote file: 160 kB
docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp ADDED

Git LFS Details

  • SHA256: 868fb5bde86d4f48d05403866e78ac4a9b058efd1dba10f388e553edb5ea294d
  • Pointer size: 131 Bytes
  • Size of remote file: 442 kB
docs/assets/showcases/t2i_general/9_16_human_pose_11.webp ADDED

Git LFS Details

  • SHA256: 6817a79b297833698bbd078c170ceab30a970a3f2e49826c6c5f65e5e5b6d192
  • Pointer size: 131 Bytes
  • Size of remote file: 111 kB
docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp ADDED
docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp ADDED

Git LFS Details

  • SHA256: 556c2e1676084f3cddbd74fe60d2a104ce76551ad8c78de6b4ffd5a1257aa767
  • Pointer size: 131 Bytes
  • Size of remote file: 136 kB
docs/assets/showcases/t2i_reasoning/1_reasoning.png ADDED

Git LFS Details

  • SHA256: 01f118ba5cb52b1d9d9cff5da0ffc5d9e6a76ab343d4de6d6ff486ece857a44e
  • Pointer size: 132 Bytes
  • Size of remote file: 1.37 MB
docs/assets/showcases/t2i_reasoning/2_reasoning.png ADDED

Git LFS Details

  • SHA256: acb80dfedf8f0b5233c6a50ff7dec7d99fb5162a635c2972f0bbf016ccdd732f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.36 MB
docs/assets/showcases/t2i_reasoning/3_reasoning.png ADDED

Git LFS Details

  • SHA256: b58e4fed10ac41a9f4d243c0315f1736e5aeac45bd738002c919ae70e9f9230f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.19 MB
docs/assets/showcases/t2i_reasoning/4_reasoning.png ADDED

Git LFS Details

  • SHA256: 8fad9ca5022f190b37b7f3d576047bd32608b4774390b859606a1403eae16a3e
  • Pointer size: 132 Bytes
  • Size of remote file: 2.11 MB
docs/assets/showcases/t2i_reasoning/5_reasoning.png ADDED

Git LFS Details

  • SHA256: 190aada770284379dcc2e2c60b04ac7a3bf95d5d3107e1a9fec346ca25cb732c
  • Pointer size: 132 Bytes
  • Size of remote file: 1.05 MB
docs/assets/showcases/t2i_reasoning/6_reasoning.png ADDED

Git LFS Details

  • SHA256: eedf078c84dd37b49f9c0158efa2419c8193f16dcd10904b9b02b45c95878a98
  • Pointer size: 132 Bytes
  • Size of remote file: 1.2 MB
docs/assets/showcases/t2i_reasoning/7_reasoning.png ADDED

Git LFS Details

  • SHA256: 8f2e443df8a418ef66522bc554d434ceb4043000ee6569b7c925791a4561b066
  • Pointer size: 132 Bytes
  • Size of remote file: 1.1 MB
docs/assets/teaser.png CHANGED

Git LFS Details

  • SHA256: 0716a056fe9fbc8dec1a892cf0c53f8e29511d9c16907c1cff56fce2427f2f88
  • Pointer size: 131 Bytes
  • Size of remote file: 351 kB

Git LFS Details

  • SHA256: 5e6c914c410db78e194a39d54c8664e3d13c21a8d5c404b232b5a0ea944a2a2a
  • Pointer size: 132 Bytes
  • Size of remote file: 1.87 MB
docs/deployment.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LightLLM + LightX2V Deployment
2
+
3
+ <p align="center">
4
+ <a href="../README.md">← Back to main README</a>
5
+ </p>
6
+
7
+ This guide provides a practical deployment flow for serving SenseNova-U1 with
8
+ LightLLM + LightX2V using the Docker image
9
+ `lightx2v/lightllm_lightx2v:20260407`.
10
+
11
+ ## 1) Pull and enter the Docker image
12
+
13
+ ```bash
14
+ docker pull lightx2v/lightllm_lightx2v:20260407
15
+ docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash
16
+ ```
17
+
18
+ ## 2) Clone runtime dependencies inside the container
19
+
20
+ The image may not include the latest source trees. Clone both repositories and
21
+ pin LightLLM to the validated branch:
22
+
23
+ ```bash
24
+ git clone https://github.com/ModelTC/LightX2V.git
25
+ git clone https://github.com/ModelTC/LightLLM.git
26
+ cd LightLLM
27
+ git checkout neo_plus_clean
28
+ ```
29
+
30
+ ## 3) X2I-related arguments
31
+
32
+ When enabling image generation in the same API server, use the following flags:
33
+
34
+ - `--enable_multimodal_x2i`
35
+ Enable image generation capability.
36
+ - `--x2i_server_used_gpus`
37
+ Number of GPUs reserved for the X2I generation server.
38
+ - `--x2i_server_deploy_mode {colocate,separate}`
39
+ - `colocate`: understanding and generation share the same visible GPU pool.
40
+ - `separate`: understanding and generation are deployed as separate services, and
41
+ can use different GPU sets.
42
+ - `--x2i_use_naive_impl`
43
+ Use the native/naive PyTorch backend for X2I (debugging/testing only, not for
44
+ production throughput).
45
+
46
+ ## 4) Deployment modes
47
+
48
+ ### Mode A: `colocate` (single service, shared GPU pool)
49
+
50
+ Use this mode for quick validation and simpler operations. The LLM understanding
51
+ path (`--tp`) and X2I generation path (`--x2i_server_used_gpus`) consume resources
52
+ from the same visible GPUs.
53
+
54
+ Example (2 GPUs total):
55
+ - understanding path: `tp=2`
56
+ - generation path: `cfg=2` (configured in `neopp_dense_parallel_cfg.json`)
57
+
58
+ ```bash
59
+ PYTHONPATH=/workspace/LightX2V/ \
60
+ python -m lightllm.server.api_server \
61
+ --model_dir $MODEL_DIR \
62
+ --enable_multimodal_x2i \
63
+ --x2i_server_deploy_mode colocate \
64
+ --x2i_server_used_gpus 2 \
65
+ --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \
66
+ --host 0.0.0.0 \
67
+ --port 8000 \
68
+ --max_req_total_len 65536 \
69
+ --mem_fraction 0.75 \
70
+ --tp 2
71
+ ```
72
+
73
+ ### Mode B: `separate` (understanding and generation decoupled)
74
+
75
+ `separate` is conceptually similar to PD-style decoupling in LLM serving: split
76
+ different stages onto different GPU groups so a long stage does not block the
77
+ short stage.
78
+
79
+ For multimodal serving, image generation is usually the long stage, while
80
+ understanding is short and lightweight. Separating them allows understanding
81
+ requests to keep flowing even when generation workers are busy.
82
+
83
+ Recommended deployment profiles:
84
+
85
+ 1. **Default profile (continuity-first): Understanding `tp=1` + Generation 1 GPU**
86
+ - Understanding: `--tp 1`
87
+ - Generation: `--x2i_server_used_gpus 1`
88
+ - Use as the baseline profile for mixed workloads. It keeps the pipeline simple
89
+ while avoiding head-of-line blocking between understanding and generation.
90
+
91
+ 2. **Understanding-expanded profile: Understanding `tp=2` + Generation 1 GPU**
92
+ - Understanding: `--tp 2`
93
+ - Generation: `--x2i_server_used_gpus 1`
94
+ - Use when complex prompts or high understanding QPS become the bottleneck.
95
+
96
+ 3. **Generation-expanded profile: Understanding `tp=1/2` + Generation parallel**
97
+ - Understanding: `--tp 1` or `--tp 2`
98
+ - Generation option A (2 GPUs): `--x2i_server_used_gpus 2` +
99
+ `/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json`
100
+ - Generation option B (4 GPUs): `--x2i_server_used_gpus 4` +
101
+ `/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json`
102
+ - Use when generation latency/throughput dominates (most common scaling path).
103
+
104
+ Example launch (separate mode in API server):
105
+
106
+ ```bash
107
+ PYTHONPATH=/workspace/LightX2V/ \
108
+ python -m lightllm.server.api_server \
109
+ --model_dir $MODEL_DIR \
110
+ --enable_multimodal_x2i \
111
+ --x2i_server_deploy_mode separate \
112
+ --x2i_server_used_gpus 1 \
113
+ --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \
114
+ --host 0.0.0.0 \
115
+ --port 8000 \
116
+ --max_req_total_len 65536 \
117
+ --mem_fraction 0.75 \
118
+ --tp 2
119
+ ```
120
+
121
+ ## 5) Quantization
122
+
123
+ `separate` mode also enables independent quantization strategies for
124
+ understanding and generation.
125
+
126
+ Because understanding and generation are decoupled, you can tune quality/latency
127
+ for each path independently:
128
+
129
+ 1. **Understanding FP16/BF16 + Generation FP8**
130
+ - Understanding: no quantization flag (keep default precision)
131
+ - Generation: use FP8 generation config, for example
132
+ `/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
133
+ - Recommended as the default quantized profile for production.
134
+
135
+ 2. **Understanding FP8 + Generation FP8**
136
+ - Understanding: add `--quant_type fp8w8a8`
137
+ - Generation: use FP8 generation config
138
+ `/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
139
+ - Use when GPU memory/throughput is the primary constraint.
140
+
141
+ Notes:
142
+ - `--quant_type fp8w8a8` controls quantization on the understanding path.
143
+ - Generation-side precision is controlled by `--x2v_gen_model_config`.
docs/inference_infrastructure.md CHANGED
@@ -6,7 +6,86 @@
6
 
7
  This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
8
 
9
-
10
  ## Overview
11
 
12
- A unified model that handles both understanding and generation has fundamentally asymmetric compute profiles: the understanding side is prefill-heavy and KV-bound, while the generation side is iterative-decode-heavy and bandwidth/compute-bound. Serving them on a single, monolithic engine leaves both sides suboptimal. SenseNova-U1's inference stack therefore adopts a **disaggregated** design, with specialized optimizations on each side and a lightweight transfer layer in between.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
8
 
 
9
  ## Overview
10
 
11
+ SenseNova-U1 is exposed as one unified multimodal model, but the understanding and generation paths exhibit different execution shapes in production. They tend to prefer different scheduling policies, parallelization strategies, and resource ratios, rather than a single shared serving configuration. When both are coupled inside one monolithic runtime, these choices become unnecessarily tied together, which can leave both paths operating away from their respective optimal points.
12
+
13
+ To avoid this coupling, SenseNova-U1 adopts a **disaggregated** architecture:
14
+
15
+ - **LightLLM** for understanding, text streaming, and control flow
16
+ - **LightX2V** for image generation
17
+
18
+ These two engines exchange generation state through pinned shared memory and high-performance transfer kernels. The handoff is lightweight, while each side can still run with its own optimal execution policy.
19
+
20
+ ![LightLLM + LightX2V decoupled architecture](./assets/lightllm_x2v.png)
21
+
22
+ This design provides practical benefits in production:
23
+
24
+ - Independent parallelism (for example, understanding with `TP=2`, generation
25
+ with `CFG=2` or `SP=2`).
26
+ - Independent resource allocation (different GPU counts and memory budgets).
27
+ - Independent scaling for text-heavy vs. image-heavy traffic.
28
+ - Better operational isolation and simpler performance tuning.
29
+
30
+ The same architecture can be deployed in two modes, depending on your hardware budget and traffic pattern:
31
+
32
+ - **Separate**: LightLLM and LightX2V run on different GPU groups.
33
+ - **Colocate**: LightLLM and LightX2V run as separate processes on the same GPU.
34
+
35
+ In most production setups, `Separate` is the default choice because it gives clearer bottleneck control and independent scaling. `Colocate` is useful for quick validation, generation-heave scenes, or smaller GPU setups.
36
+
37
+ ### Attention for Multimodal Prefill of Neo
38
+
39
+ Neo's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FA3 codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
40
+
41
+ Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
42
+
43
+ To preserve the causal-triangle speedup whenever possible, the kernel makes the decision per M-block. It OR-reduces the image_token_tag values inside the current block: if the block contains no image token, it keeps the standard causal K-range; if the block contains image tokens, it extends the K-range to cover the required image span. As a result, pure-text blocks still follow the normal causal path, while only the relevant blocks pay the extra work needed by the hybrid mask.
44
+
45
+ ![Neo multimodal attention behavior](./assets/att.png)
46
+
47
+ The overhead therefore does not depend on a fixed ratio, but on how image tokens are distributed across the sequence and across M-block boundaries. When image rows are concentrated in only part of the sequence, the extra work is correspondingly localized. For text-only requests, image_token_tag is empty, and the kernel falls back to vanilla FA3 with no additional overhead.
48
+ The benchmark below compares two implementations for Neo-style multimodal prefill:
49
+
50
+ - **Triton implementation**: easier to migrate into existing codebases, with lower
51
+ integration cost and faster iteration.
52
+ - **FA3 implementation**: higher absolute performance on supported hardware.
53
+
54
+ | batch | max_seq_len | image_token_num | triton (ms) | fa3 (ms) | speedup (×) |
55
+ | ----: | ----------: | --------------: | -------------: | ----------: | ----------: |
56
+ | 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
57
+ | 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
58
+ | 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
59
+ | 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
60
+ | 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
61
+ | 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
62
+ | 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
63
+ | 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
64
+ | 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
65
+ | 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
66
+ | 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
67
+ | 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
68
+ | 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
69
+ | 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
70
+ | 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
71
+
72
+
73
+ ### Deployment
74
+
75
+ For a concise deployment runbook (Docker image, startup command, and API tests),
76
+ see [`deployment.md`](./deployment.md).
77
+
78
+
79
+ ### Generation Performance
80
+
81
+ The table below is the benchmark template for **2048x2048** image generation.
82
+ Fill in measured numbers for each machine and deployment profile.
83
+
84
+ | Machine Type | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
85
+ | ---------- | ----------------- | --------------------------: | ---------------------: |
86
+ | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
87
+ | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
88
+ | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
89
+ | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
90
+
91
+ In Neo, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
docs/showcases.md CHANGED
@@ -11,11 +11,73 @@ open the full-resolution render.
11
 
12
  ## Text-to-Image
13
 
14
- The full 3 × 3 infographic grid (landscape / square / portrait) lives in
15
- the main [README § Text-to-Image](../README.md#text-to-image-infographics)
16
- to keep things in one place. Source prompts are in
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  [`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl).
18
 
 
 
 
 
 
 
19
  ---
20
 
21
  ## Image Editing
@@ -25,13 +87,82 @@ edit instruction rendered along the bottom of each compare tile. The same
25
  unified model handles single-image attribute / style / relighting edits
26
  and multi-reference (subject + accessory + pose) composition.
27
 
 
 
28
  Reproducible prompts are in
29
  [`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl).
30
 
31
- | | | |
32
- | :---: | :---: | :---: |
33
- | [<img alt="editing compare 0001" src="./assets/showcases/editing/0001_2048x2048_compare.webp">](./assets/showcases/editing/0001_2048x2048_compare.webp) | [<img alt="editing compare 0002" src="./assets/showcases/editing/0002_2048x2048_compare.webp">](./assets/showcases/editing/0002_2048x2048_compare.webp) | [<img alt="editing compare 0003" src="./assets/showcases/editing/0003_2048x2048_compare.webp">](./assets/showcases/editing/0003_2048x2048_compare.webp) |
34
- | [<img alt="editing compare 0004" src="./assets/showcases/editing/0004_2048x2048_compare.webp">](./assets/showcases/editing/0004_2048x2048_compare.webp) | [<img alt="editing compare 0005" src="./assets/showcases/editing/0005_2400x1696_compare.webp">](./assets/showcases/editing/0005_2400x1696_compare.webp) | |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ---
37
 
@@ -40,7 +171,10 @@ Reproducible prompts are in
40
  Each case below is a single rendered response from `model.interleave_gen`:
41
  the model first runs a `<think>...</think>` reasoning block that produces
42
  intermediate images, then emits the final interleaved text-and-image
43
- answer
 
 
 
44
 
45
 
46
  | |
 
11
 
12
  ## Text-to-Image
13
 
14
+ The main table presents the complete n × 3 grid layouts, covering landscape, square, and portrait formats at different resolutions.
15
+
16
+
17
+ #### 🖼️ *Text-to-Image (General)*
18
+
19
+ Reproducible prompts are in
20
+ [`examples/t2i/data/samples.jsonl`](../examples/t2i/data/samples.jsonl).
21
+
22
+
23
+ | | | |
24
+ | :---: | :---: | :---: |
25
+ | [<img width="300" alt="t2i general dense face hd 07" src="./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
26
+ | [<img width="260" alt="t2i general face hd 13" src="./assets/showcases/t2i_general/1_1_face_hd_13.webp">](./assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./assets/showcases/t2i_general/1_1_face_hd_17.webp">](./assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
27
+ | [<img width="260" alt="t2i general landscape 06" src="./assets/showcases/t2i_general/1_1_landscape_06.webp">](./assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./assets/showcases/t2i_general/1_1_landscape_07.webp">](./assets/showcases/t2i_general/1_1_landscape_07.webp) |
28
+ | [<img width="200" alt="t2i general portrait artistic 02 a" src="./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./assets/showcases/t2i_general/9_16_human_pose_11.webp">](./assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./assets/showcases/t2i_general/9_16_artistic_07.webp">](./assets/showcases/t2i_general/9_16_artistic_07.webp) |
29
+ | [<img width="200" alt="t2i general portrait text rendering 02" src="./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
30
+
31
+
32
+ #### 🖼️ *Text-to-Image (Reasoning)*
33
+
34
+ Reproducible prompts are in
35
+ [`examples/t2i/data/sample_reasoning.jsonl`](../examples/t2i/data/sample_reasoning.jsonl).
36
+
37
+ <table>
38
+ <tr>
39
+ <th style="width: 20%">Original Text</th>
40
+ <th style="width: 50%">Reasoning Process</th>
41
+ <th style="width: 30%">Resulting Image</th>
42
+ </tr>
43
+ <tr>
44
+ <td style="vertical-align: top;">The playful craft that embodies Russian cultural charm</td>
45
+ <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is the matryoshka, identified as a Russian craft. Essential modifiers include playful and Russian cultural charm. The relation is that the craft embodies the charm. 2. <b>Reasoning Process:</b> The prompt identifies the matryoshka, the iconic Russian wooden doll set. 3. <b>Establish the frame:</b> The frame captures the matryoshka set in the foreground. The composition focuses on the Russian craft to show the playful nature of the doll set. 4. <b>Set the lighting and color:</b> Lighting illuminates the matryoshka to reveal the Russian cultural charm. The color palette supports the playful craft aesthetic. 5. <b>Lock the style:</b> The style emphasizes the wooden nature of the Russian craft. The finish reflects the charm of the matryoshka. 6. <b>Explicit Prompt:</b> A set of three colorful, hand-painted wooden matryoshka dolls arranged by size on a rustic wooden table, bright floral patterns, soft natural light.</div></td>
46
+ <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/1_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
47
+ </tr>
48
+ <tr>
49
+ <td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
50
+ <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
51
+ <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
52
+ </tr>
53
+ <tr>
54
+ <td style="vertical-align: top;">A gigantic bubble in the immediate foreground with a small town barely visible inside</td>
55
+ <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The prompt requires a gigantic bubble positioned in the immediate foreground. A small town must be visible inside the bubble. 2. <b>Reasoning Process:</b> The bubble acts as a transparent lens, potentially distorting the town's appearance due to its curvature. 3. <b>Establish the frame:</b> The shot scale is an extreme close-up, placing the bubble so close it fills the majority of the frame. The focus plane is set on the bubble's surface, rendering the background outside the town soft and out of focus. 4. <b>Set the lighting and color:</b> Natural daylight creates bright specular highlights on the upper surface of the bubble. The color palette is dominated by the cool transparency of the bubble and the muted tones of the town within. 5. <b>Lock the style:</b> The overall image style is photorealistic with a high level of detail on the bubble's texture. The rendering captures the glossy finish and physical properties of the bubble accurately. 6. <b>Explicit Prompt:</b> An extreme close-up of a massive, transparent soap bubble floating in the immediate foreground, with a tiny, detailed miniature town visible inside the curvature, refracted and magnified by the surface tension, soft natural daylight, dreamlike, high detail, photorealistic texture.</div></td>
56
+ <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/3_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
57
+ </tr>
58
+ <tr>
59
+ <td style="vertical-align: top;">A chocolate bar left in direct sunlight, highlighting the state of the chocolate</td>
60
+ <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a chocolate bar positioned in direct sunlight. The focus is on the state of the chocolate, specifically how the heat affects it. 2. <b>Reasoning Process:</b> Heat causes chocolate to melt, losing its structured shape and becoming a viscous, glossy liquid. 3. <b>Establish the frame:</b> The composition is a close-up shot that fills the frame with the chocolate bar to emphasize detail. The angle is slightly elevated to show the top surface and the pooling liquid clearly. 4. <b>Build the environment:</b> The chocolate bar rests on a generic surface that supports the object without distracting from the main subject. The background is blurred to keep attention on the foreground elements and the chocolate. 5. <b>Set the lighting and color:</b> Direct sunlight creates bright highlights on the melting chocolate, emphasizing its glossy texture. The lighting is warm and intense, casting distinct shadows and illuminating the rich brown colors of the liquid. 6. <b>Explicit Prompt:</b> A close-up of a melting chocolate bar on a surface, with the edges losing their defined shape and pooling into a glossy, viscous puddle under the heat of the sun.</div></td>
61
+ <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/6_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
62
+ </tr>
63
+ <tr>
64
+ <td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
65
+ <td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
66
+ <td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
67
+ </tr>
68
+ </table>
69
+
70
+ #### 🖼️ *Text-to-Image (Infographics)*
71
+
72
+ Reproducible prompts are in
73
  [`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl).
74
 
75
+ | | | |
76
+ | :---: | :---: | :---: |
77
+ | [<img width="300" alt="t2i landscape 0001" src="./assets/showcases/t2i_infographic/0001_2720x1536.webp">](./assets/showcases/t2i_infographic/0001_2720x1536.webp) | [<img width="300" alt="t2i landscape 0002" src="./assets/showcases/t2i_infographic/0002_2720x1536.webp">](./assets/showcases/t2i_infographic/0002_2720x1536.webp) | [<img width="300" alt="t2i landscape 0003" src="./assets/showcases/t2i_infographic/0003_2720x1536.webp">](./assets/showcases/t2i_infographic/0003_2720x1536.webp) |
78
+ | [<img width="300" alt="t2i square 0004" src="./assets/showcases/t2i_infographic/0004_2048x2048.webp">](./assets/showcases/t2i_infographic/0004_2048x2048.webp) | [<img width="300" alt="t2i square 0005" src="./assets/showcases/t2i_infographic/0005_2048x2048.webp">](./assets/showcases/t2i_infographic/0005_2048x2048.webp) | [<img width="300" alt="t2i square 0006" src="./assets/showcases/t2i_infographic/0006_2048x2048.webp">](./assets/showcases/t2i_infographic/0006_2048x2048.webp) |
79
+ | [<img width="200" alt="t2i portrait 0007" src="./assets/showcases/t2i_infographic/0007_1536x2720.webp">](./assets/showcases/t2i_infographic/0007_1536x2720.webp) | [<img width="200" alt="t2i portrait 0008" src="./assets/showcases/t2i_infographic/0008_1536x2720.webp">](./assets/showcases/t2i_infographic/0008_1536x2720.webp) | [<img width="200" alt="t2i portrait 0009" src="./assets/showcases/t2i_infographic/0009_1536x2720.webp">](./assets/showcases/t2i_infographic/0009_1536x2720.webp) |
80
+
81
  ---
82
 
83
  ## Image Editing
 
87
  unified model handles single-image attribute / style / relighting edits
88
  and multi-reference (subject + accessory + pose) composition.
89
 
90
+ #### ✏️ *Image Editing (General)*
91
+
92
  Reproducible prompts are in
93
  [`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl).
94
 
95
+ | | |
96
+ | :---: | :---: |
97
+ | <div align="center"><a href="../examples/editing/data/images/1.webp"><img width="180" alt="editing input 1" src="../examples/editing/data/images/1.webp"></a> <a href="../examples/editing/data/images/1_out.webp"><img width="180" alt="editing output 1" src="../examples/editing/data/images/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <div align="center"><a href="../examples/editing/data/images/3.webp"><img width="180" alt="editing input 3" src="../examples/editing/data/images/3.webp"></a> <a href="../examples/editing/data/images/3_out.webp"><img width="180" alt="editing output 3" src="../examples/editing/data/images/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
98
+ | <div align="center"><a href="../examples/editing/data/images/2.webp"><img width="180" alt="editing input 2" src="../examples/editing/data/images/2.webp"></a> <a href="../examples/editing/data/images/2_out.webp"><img width="180" alt="editing output 2" src="../examples/editing/data/images/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <div align="center"><a href="../examples/editing/data/images/4.webp"><img width="180" alt="editing input 4" src="../examples/editing/data/images/4.webp"></a> <a href="../examples/editing/data/images/4_out.webp"><img width="180" alt="editing output 4" src="../examples/editing/data/images/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
99
+ | <div align="center"><a href="../examples/editing/data/images/5.webp"><img width="180" alt="editing input 5" src="../examples/editing/data/images/5.webp"></a> <a href="../examples/editing/data/images/5_out.webp"><img width="180" alt="editing output 5" src="../examples/editing/data/images/5_out.webp"></a><br><sub>Turn the image into an American comic style.</sub></div> | <div align="center"><a href="../examples/editing/data/images/8.webp"><img width="180" alt="editing input 8" src="../examples/editing/data/images/8.webp"></a> <a href="../examples/editing/data/images/8_out.webp"><img width="180" alt="editing output 8" src="../examples/editing/data/images/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> |
100
+ | <div align="center"><a href="../examples/editing/data/images/6.webp"><img width="180" alt="editing input 6" src="../examples/editing/data/images/6.webp"></a> <a href="../examples/editing/data/images/6_out.webp"><img width="180" alt="editing output 6" src="../examples/editing/data/images/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> | <div align="center"><a href="../examples/editing/data/images/7.webp"><img width="180" alt="editing input 7" src="../examples/editing/data/images/7.webp"></a> <a href="../examples/editing/data/images/7_out.webp"><img width="180" alt="editing output 7" src="../examples/editing/data/images/7_out.webp"></a><br><sub>Remove the person on the far right wearing a green skirt and a green top.</sub></div> |
101
+
102
+
103
+ #### ✏️ *Image Editing (Reasoning)*
104
+
105
+ Reproducible prompts are in
106
+ [`examples/editing/data/samples_reasoning.jsonl`](../examples/editing/data/samples_reasoning.jsonl).
107
+
108
+ <table>
109
+ <tr>
110
+ <th style="width: 20%">Original Text</th>
111
+ <th style="width: 30%">Original Image</th>
112
+ <th style="width: 20%">Reasoning Process</th>
113
+ <th style="width: 30%">Resulting Image</th>
114
+ </tr>
115
+ <tr>
116
+ <td style="vertical-align: top;">Draw what it will look like one hour later.</td>
117
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
118
+ <td><div style="max-height: 200px; overflow-y: auto;">
119
+ 1. <b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
120
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
121
+ </tr>
122
+ <tr>
123
+ <td style="vertical-align: top;">Draw what it will look like immediately after someone stands up from sitting on it for a long time.</td>
124
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
125
+ <td><div style="max-height: 200px; overflow-y: auto;">
126
+ 1. <b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag’s shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
127
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
128
+ </tr>
129
+ <tr>
130
+ <td style="vertical-align: top;">Draw an image showing the side view of the provided traffic cone.</td>
131
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
132
+ <td><div style="max-height: 200px; overflow-y: auto;">
133
+ 1. <b>Source Image Analysis:</b> The source image shows a 3D perspective view of a traffic cone. 2. <b>Instruction Understanding:</b> The instruction asks for a side view. 3. <b>Reasoning Process:</b> A side view of a standard traffic cone results in a triangular silhouette with a flat rectangular base. 4. <b>Expected Visual Changes:</b> The perspective is flattened into this 2D-like geometric profile. 5. <b>Elements to Preserve:</b> The cone's height and color should remain consistent with the original object. 6. <b>Explicit Edit Prompt:</b> Edit the perspective view into a flat side-profile silhouette of a triangle with a rectangular base, keeping the red color and proportions unchanged.</div></td>
134
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
135
+ </tr>
136
+ <tr>
137
+ <td style="vertical-align: top;">Change the water to high-concentration saltwater</td>
138
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
139
+ <td><div style="max-height: 200px; overflow-y: auto;">
140
+ 1. <b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
141
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
142
+ </tr>
143
+ <tr>
144
+ <td style="vertical-align: top;">What the fruit looks like when ripe in the picture</td>
145
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
146
+ <td><div style="max-height: 200px; overflow-y: auto;">
147
+ 1. <b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
148
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
149
+ </tr>
150
+ <tr>
151
+ <td style="vertical-align: top;">Correct the unreasonable part in the image.</td>
152
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
153
+ <td><div style="max-height: 200px; overflow-y: auto;">
154
+ 1. <b>Source Image Analysis:</b> The source image shows a kettle pouring water onto a mug, but the stream is misaligned and missing the cup. 2. <b>Instruction Understanding:</b> The instruction asks to fix the physical inconsistency. 3. <b>Reasoning Process:</b> The water stream must be redirected to connect the spout to the mug, maintaining the trajectory of gravity. 4. <b>Expected Visual Changes:</b> The water stream will be redirected to connect the spout to the mug. 5. <b>Elements to Preserve:</b> The kettle, mug, and background must remain unchanged while the water path is corrected. 6. <b>Explicit Edit Prompt:</b> Draw a continuous water stream connecting the kettle spout to the mug, keeping the kettle, mug, and background unchanged.</div></td>
155
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
156
+ </tr>
157
+ <tr>
158
+ <td style="vertical-align: top;">Modify the matrix in the image to an upper triangular matrix</td>
159
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
160
+ <td><div style="max-height: 200px; overflow-y: auto;">
161
+ 1. <b>Source Image Analysis:</b> The source image shows a 2x2 matrix with values 1, 2, 3, and 4. 2. <b>Instruction Understanding:</b> The instruction asks to convert this to an upper triangular matrix. 3. <b>Reasoning Process:</b> By definition, an upper triangular matrix has zeros below the main diagonal, so the entry '3' must be changed to '0' while keeping '1', '2', and '4' as they are, and this modification satisfies the mathematical property requested. 4. <b>Expected Visual Changes:</b> The entry '3' in the lower-left position will be changed to '0'. 5. <b>Elements to Preserve:</b> The grid lines, the matrix structure, and the other entries must remain unchanged. 6. <b>Explicit Edit Prompt:</b> Change the '3' in the lower-left position to '0', while keeping the matrix structure and other entries unchanged.</div></td>
162
+ <td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
163
+ </tr>
164
+ </table>
165
+
166
 
167
  ---
168
 
 
171
  Each case below is a single rendered response from `model.interleave_gen`:
172
  the model first runs a `<think>...</think>` reasoning block that produces
173
  intermediate images, then emits the final interleaved text-and-image
174
+ answer.
175
+
176
+ Reproducible prompts are in
177
+ [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/sample.jsonl).
178
 
179
 
180
  | |