Upload folder using huggingface_hub
Browse files- .gitattributes +27 -0
- docs/assets/att.png +3 -0
- docs/assets/benchmarks/generation.webp +3 -0
- docs/assets/benchmarks/interleaved.webp +0 -0
- docs/assets/benchmarks/understanding.webp +3 -0
- docs/assets/lightllm_x2v.png +3 -0
- docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp +3 -0
- docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp +3 -0
- docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp +0 -0
- docs/assets/showcases/t2i_general/1_1_artistic_02.webp +3 -0
- docs/assets/showcases/t2i_general/1_1_dense_artistic_09.webp +0 -0
- docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp +3 -0
- docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp +3 -0
- docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp +3 -0
- docs/assets/showcases/t2i_general/1_1_face_hd_13.webp +3 -0
- docs/assets/showcases/t2i_general/1_1_face_hd_17.webp +3 -0
- docs/assets/showcases/t2i_general/1_1_landscape_06.webp +3 -0
- docs/assets/showcases/t2i_general/1_1_landscape_07.webp +3 -0
- docs/assets/showcases/t2i_general/9_16_artistic_07.webp +3 -0
- docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp +3 -0
- docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp +3 -0
- docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp +3 -0
- docs/assets/showcases/t2i_general/9_16_human_pose_11.webp +3 -0
- docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp +0 -0
- docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp +3 -0
- docs/assets/showcases/t2i_reasoning/1_reasoning.png +3 -0
- docs/assets/showcases/t2i_reasoning/2_reasoning.png +3 -0
- docs/assets/showcases/t2i_reasoning/3_reasoning.png +3 -0
- docs/assets/showcases/t2i_reasoning/4_reasoning.png +3 -0
- docs/assets/showcases/t2i_reasoning/5_reasoning.png +3 -0
- docs/assets/showcases/t2i_reasoning/6_reasoning.png +3 -0
- docs/assets/showcases/t2i_reasoning/7_reasoning.png +3 -0
- docs/assets/teaser.png +2 -2
- docs/deployment.md +143 -0
- docs/inference_infrastructure.md +81 -2
- docs/showcases.md +142 -8
.gitattributes
CHANGED
|
@@ -70,3 +70,30 @@ docs/assets/showcases/t2i_infographic/0009_1536x2720.webp filter=lfs diff=lfs me
|
|
| 70 |
docs/assets/showcases/vqa/agentic_case.webp filter=lfs diff=lfs merge=lfs -text
|
| 71 |
docs/assets/showcases/vqa/general_case.webp filter=lfs diff=lfs merge=lfs -text
|
| 72 |
docs/assets/teaser.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
docs/assets/showcases/vqa/agentic_case.webp filter=lfs diff=lfs merge=lfs -text
|
| 71 |
docs/assets/showcases/vqa/general_case.webp filter=lfs diff=lfs merge=lfs -text
|
| 72 |
docs/assets/teaser.png filter=lfs diff=lfs merge=lfs -text
|
| 73 |
+
docs/assets/att.png filter=lfs diff=lfs merge=lfs -text
|
| 74 |
+
docs/assets/benchmarks/generation.webp filter=lfs diff=lfs merge=lfs -text
|
| 75 |
+
docs/assets/benchmarks/understanding.webp filter=lfs diff=lfs merge=lfs -text
|
| 76 |
+
docs/assets/lightllm_x2v.png filter=lfs diff=lfs merge=lfs -text
|
| 77 |
+
docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp filter=lfs diff=lfs merge=lfs -text
|
| 78 |
+
docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp filter=lfs diff=lfs merge=lfs -text
|
| 79 |
+
docs/assets/showcases/t2i_general/1_1_artistic_02.webp filter=lfs diff=lfs merge=lfs -text
|
| 80 |
+
docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp filter=lfs diff=lfs merge=lfs -text
|
| 81 |
+
docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp filter=lfs diff=lfs merge=lfs -text
|
| 82 |
+
docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp filter=lfs diff=lfs merge=lfs -text
|
| 83 |
+
docs/assets/showcases/t2i_general/1_1_face_hd_13.webp filter=lfs diff=lfs merge=lfs -text
|
| 84 |
+
docs/assets/showcases/t2i_general/1_1_face_hd_17.webp filter=lfs diff=lfs merge=lfs -text
|
| 85 |
+
docs/assets/showcases/t2i_general/1_1_landscape_06.webp filter=lfs diff=lfs merge=lfs -text
|
| 86 |
+
docs/assets/showcases/t2i_general/1_1_landscape_07.webp filter=lfs diff=lfs merge=lfs -text
|
| 87 |
+
docs/assets/showcases/t2i_general/9_16_artistic_07.webp filter=lfs diff=lfs merge=lfs -text
|
| 88 |
+
docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp filter=lfs diff=lfs merge=lfs -text
|
| 89 |
+
docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp filter=lfs diff=lfs merge=lfs -text
|
| 90 |
+
docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp filter=lfs diff=lfs merge=lfs -text
|
| 91 |
+
docs/assets/showcases/t2i_general/9_16_human_pose_11.webp filter=lfs diff=lfs merge=lfs -text
|
| 92 |
+
docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp filter=lfs diff=lfs merge=lfs -text
|
| 93 |
+
docs/assets/showcases/t2i_reasoning/1_reasoning.png filter=lfs diff=lfs merge=lfs -text
|
| 94 |
+
docs/assets/showcases/t2i_reasoning/2_reasoning.png filter=lfs diff=lfs merge=lfs -text
|
| 95 |
+
docs/assets/showcases/t2i_reasoning/3_reasoning.png filter=lfs diff=lfs merge=lfs -text
|
| 96 |
+
docs/assets/showcases/t2i_reasoning/4_reasoning.png filter=lfs diff=lfs merge=lfs -text
|
| 97 |
+
docs/assets/showcases/t2i_reasoning/5_reasoning.png filter=lfs diff=lfs merge=lfs -text
|
| 98 |
+
docs/assets/showcases/t2i_reasoning/6_reasoning.png filter=lfs diff=lfs merge=lfs -text
|
| 99 |
+
docs/assets/showcases/t2i_reasoning/7_reasoning.png filter=lfs diff=lfs merge=lfs -text
|
docs/assets/att.png
ADDED
|
Git LFS Details
|
docs/assets/benchmarks/generation.webp
ADDED
|
Git LFS Details
|
docs/assets/benchmarks/interleaved.webp
ADDED
|
docs/assets/benchmarks/understanding.webp
ADDED
|
Git LFS Details
|
docs/assets/lightllm_x2v.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/16_9_dense_face_hd_07.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp
ADDED
|
docs/assets/showcases/t2i_general/1_1_artistic_02.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/1_1_dense_artistic_09.webp
ADDED
|
docs/assets/showcases/t2i_general/1_1_dense_artistic_10.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/1_1_dense_artistic_19.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/1_1_dense_landscape_12.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/1_1_face_hd_13.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/1_1_face_hd_17.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/1_1_landscape_06.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/1_1_landscape_07.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/9_16_artistic_07.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/9_16_dense_artistic_11.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/9_16_dense_face_hd_10.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/9_16_dense_landscape_05.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/9_16_human_pose_11.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_general/9_16_sensenova_u1_31.webp
ADDED
|
docs/assets/showcases/t2i_general/9_16_text_rendering_02.webp
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_reasoning/1_reasoning.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_reasoning/2_reasoning.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_reasoning/3_reasoning.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_reasoning/4_reasoning.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_reasoning/5_reasoning.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_reasoning/6_reasoning.png
ADDED
|
Git LFS Details
|
docs/assets/showcases/t2i_reasoning/7_reasoning.png
ADDED
|
Git LFS Details
|
docs/assets/teaser.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
docs/deployment.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LightLLM + LightX2V Deployment
|
| 2 |
+
|
| 3 |
+
<p align="center">
|
| 4 |
+
<a href="../README.md">← Back to main README</a>
|
| 5 |
+
</p>
|
| 6 |
+
|
| 7 |
+
This guide provides a practical deployment flow for serving SenseNova-U1 with
|
| 8 |
+
LightLLM + LightX2V using the Docker image
|
| 9 |
+
`lightx2v/lightllm_lightx2v:20260407`.
|
| 10 |
+
|
| 11 |
+
## 1) Pull and enter the Docker image
|
| 12 |
+
|
| 13 |
+
```bash
|
| 14 |
+
docker pull lightx2v/lightllm_lightx2v:20260407
|
| 15 |
+
docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
## 2) Clone runtime dependencies inside the container
|
| 19 |
+
|
| 20 |
+
The image may not include the latest source trees. Clone both repositories and
|
| 21 |
+
pin LightLLM to the validated branch:
|
| 22 |
+
|
| 23 |
+
```bash
|
| 24 |
+
git clone https://github.com/ModelTC/LightX2V.git
|
| 25 |
+
git clone https://github.com/ModelTC/LightLLM.git
|
| 26 |
+
cd LightLLM
|
| 27 |
+
git checkout neo_plus_clean
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## 3) X2I-related arguments
|
| 31 |
+
|
| 32 |
+
When enabling image generation in the same API server, use the following flags:
|
| 33 |
+
|
| 34 |
+
- `--enable_multimodal_x2i`
|
| 35 |
+
Enable image generation capability.
|
| 36 |
+
- `--x2i_server_used_gpus`
|
| 37 |
+
Number of GPUs reserved for the X2I generation server.
|
| 38 |
+
- `--x2i_server_deploy_mode {colocate,separate}`
|
| 39 |
+
- `colocate`: understanding and generation share the same visible GPU pool.
|
| 40 |
+
- `separate`: understanding and generation are deployed as separate services, and
|
| 41 |
+
can use different GPU sets.
|
| 42 |
+
- `--x2i_use_naive_impl`
|
| 43 |
+
Use the native/naive PyTorch backend for X2I (debugging/testing only, not for
|
| 44 |
+
production throughput).
|
| 45 |
+
|
| 46 |
+
## 4) Deployment modes
|
| 47 |
+
|
| 48 |
+
### Mode A: `colocate` (single service, shared GPU pool)
|
| 49 |
+
|
| 50 |
+
Use this mode for quick validation and simpler operations. The LLM understanding
|
| 51 |
+
path (`--tp`) and X2I generation path (`--x2i_server_used_gpus`) consume resources
|
| 52 |
+
from the same visible GPUs.
|
| 53 |
+
|
| 54 |
+
Example (2 GPUs total):
|
| 55 |
+
- understanding path: `tp=2`
|
| 56 |
+
- generation path: `cfg=2` (configured in `neopp_dense_parallel_cfg.json`)
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
PYTHONPATH=/workspace/LightX2V/ \
|
| 60 |
+
python -m lightllm.server.api_server \
|
| 61 |
+
--model_dir $MODEL_DIR \
|
| 62 |
+
--enable_multimodal_x2i \
|
| 63 |
+
--x2i_server_deploy_mode colocate \
|
| 64 |
+
--x2i_server_used_gpus 2 \
|
| 65 |
+
--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \
|
| 66 |
+
--host 0.0.0.0 \
|
| 67 |
+
--port 8000 \
|
| 68 |
+
--max_req_total_len 65536 \
|
| 69 |
+
--mem_fraction 0.75 \
|
| 70 |
+
--tp 2
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
### Mode B: `separate` (understanding and generation decoupled)
|
| 74 |
+
|
| 75 |
+
`separate` is conceptually similar to PD-style decoupling in LLM serving: split
|
| 76 |
+
different stages onto different GPU groups so a long stage does not block the
|
| 77 |
+
short stage.
|
| 78 |
+
|
| 79 |
+
For multimodal serving, image generation is usually the long stage, while
|
| 80 |
+
understanding is short and lightweight. Separating them allows understanding
|
| 81 |
+
requests to keep flowing even when generation workers are busy.
|
| 82 |
+
|
| 83 |
+
Recommended deployment profiles:
|
| 84 |
+
|
| 85 |
+
1. **Default profile (continuity-first): Understanding `tp=1` + Generation 1 GPU**
|
| 86 |
+
- Understanding: `--tp 1`
|
| 87 |
+
- Generation: `--x2i_server_used_gpus 1`
|
| 88 |
+
- Use as the baseline profile for mixed workloads. It keeps the pipeline simple
|
| 89 |
+
while avoiding head-of-line blocking between understanding and generation.
|
| 90 |
+
|
| 91 |
+
2. **Understanding-expanded profile: Understanding `tp=2` + Generation 1 GPU**
|
| 92 |
+
- Understanding: `--tp 2`
|
| 93 |
+
- Generation: `--x2i_server_used_gpus 1`
|
| 94 |
+
- Use when complex prompts or high understanding QPS become the bottleneck.
|
| 95 |
+
|
| 96 |
+
3. **Generation-expanded profile: Understanding `tp=1/2` + Generation parallel**
|
| 97 |
+
- Understanding: `--tp 1` or `--tp 2`
|
| 98 |
+
- Generation option A (2 GPUs): `--x2i_server_used_gpus 2` +
|
| 99 |
+
`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json`
|
| 100 |
+
- Generation option B (4 GPUs): `--x2i_server_used_gpus 4` +
|
| 101 |
+
`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json`
|
| 102 |
+
- Use when generation latency/throughput dominates (most common scaling path).
|
| 103 |
+
|
| 104 |
+
Example launch (separate mode in API server):
|
| 105 |
+
|
| 106 |
+
```bash
|
| 107 |
+
PYTHONPATH=/workspace/LightX2V/ \
|
| 108 |
+
python -m lightllm.server.api_server \
|
| 109 |
+
--model_dir $MODEL_DIR \
|
| 110 |
+
--enable_multimodal_x2i \
|
| 111 |
+
--x2i_server_deploy_mode separate \
|
| 112 |
+
--x2i_server_used_gpus 1 \
|
| 113 |
+
--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \
|
| 114 |
+
--host 0.0.0.0 \
|
| 115 |
+
--port 8000 \
|
| 116 |
+
--max_req_total_len 65536 \
|
| 117 |
+
--mem_fraction 0.75 \
|
| 118 |
+
--tp 2
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## 5) Quantization
|
| 122 |
+
|
| 123 |
+
`separate` mode also enables independent quantization strategies for
|
| 124 |
+
understanding and generation.
|
| 125 |
+
|
| 126 |
+
Because understanding and generation are decoupled, you can tune quality/latency
|
| 127 |
+
for each path independently:
|
| 128 |
+
|
| 129 |
+
1. **Understanding FP16/BF16 + Generation FP8**
|
| 130 |
+
- Understanding: no quantization flag (keep default precision)
|
| 131 |
+
- Generation: use FP8 generation config, for example
|
| 132 |
+
`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
|
| 133 |
+
- Recommended as the default quantized profile for production.
|
| 134 |
+
|
| 135 |
+
2. **Understanding FP8 + Generation FP8**
|
| 136 |
+
- Understanding: add `--quant_type fp8w8a8`
|
| 137 |
+
- Generation: use FP8 generation config
|
| 138 |
+
`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
|
| 139 |
+
- Use when GPU memory/throughput is the primary constraint.
|
| 140 |
+
|
| 141 |
+
Notes:
|
| 142 |
+
- `--quant_type fp8w8a8` controls quantization on the understanding path.
|
| 143 |
+
- Generation-side precision is controlled by `--x2v_gen_model_config`.
|
docs/inference_infrastructure.md
CHANGED
|
@@ -6,7 +6,86 @@
|
|
| 6 |
|
| 7 |
This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
|
| 8 |
|
| 9 |
-
|
| 10 |
## Overview
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
This document describes the inference infrastructure behind **SenseNova-U1**, built on top of **[LightLLM](https://github.com/ModelTC/lightllm)** and **[LightX2V](https://github.com/ModelTC/lightx2v)**.
|
| 8 |
|
|
|
|
| 9 |
## Overview
|
| 10 |
|
| 11 |
+
SenseNova-U1 is exposed as one unified multimodal model, but the understanding and generation paths exhibit different execution shapes in production. They tend to prefer different scheduling policies, parallelization strategies, and resource ratios, rather than a single shared serving configuration. When both are coupled inside one monolithic runtime, these choices become unnecessarily tied together, which can leave both paths operating away from their respective optimal points.
|
| 12 |
+
|
| 13 |
+
To avoid this coupling, SenseNova-U1 adopts a **disaggregated** architecture:
|
| 14 |
+
|
| 15 |
+
- **LightLLM** for understanding, text streaming, and control flow
|
| 16 |
+
- **LightX2V** for image generation
|
| 17 |
+
|
| 18 |
+
These two engines exchange generation state through pinned shared memory and high-performance transfer kernels. The handoff is lightweight, while each side can still run with its own optimal execution policy.
|
| 19 |
+
|
| 20 |
+

|
| 21 |
+
|
| 22 |
+
This design provides practical benefits in production:
|
| 23 |
+
|
| 24 |
+
- Independent parallelism (for example, understanding with `TP=2`, generation
|
| 25 |
+
with `CFG=2` or `SP=2`).
|
| 26 |
+
- Independent resource allocation (different GPU counts and memory budgets).
|
| 27 |
+
- Independent scaling for text-heavy vs. image-heavy traffic.
|
| 28 |
+
- Better operational isolation and simpler performance tuning.
|
| 29 |
+
|
| 30 |
+
The same architecture can be deployed in two modes, depending on your hardware budget and traffic pattern:
|
| 31 |
+
|
| 32 |
+
- **Separate**: LightLLM and LightX2V run on different GPU groups.
|
| 33 |
+
- **Colocate**: LightLLM and LightX2V run as separate processes on the same GPU.
|
| 34 |
+
|
| 35 |
+
In most production setups, `Separate` is the default choice because it gives clearer bottleneck control and independent scaling. `Colocate` is useful for quick validation, generation-heave scenes, or smaller GPU setups.
|
| 36 |
+
|
| 37 |
+
### Attention for Multimodal Prefill of Neo
|
| 38 |
+
|
| 39 |
+
Neo's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FA3 codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
|
| 40 |
+
|
| 41 |
+
Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
|
| 42 |
+
|
| 43 |
+
To preserve the causal-triangle speedup whenever possible, the kernel makes the decision per M-block. It OR-reduces the image_token_tag values inside the current block: if the block contains no image token, it keeps the standard causal K-range; if the block contains image tokens, it extends the K-range to cover the required image span. As a result, pure-text blocks still follow the normal causal path, while only the relevant blocks pay the extra work needed by the hybrid mask.
|
| 44 |
+
|
| 45 |
+

|
| 46 |
+
|
| 47 |
+
The overhead therefore does not depend on a fixed ratio, but on how image tokens are distributed across the sequence and across M-block boundaries. When image rows are concentrated in only part of the sequence, the extra work is correspondingly localized. For text-only requests, image_token_tag is empty, and the kernel falls back to vanilla FA3 with no additional overhead.
|
| 48 |
+
The benchmark below compares two implementations for Neo-style multimodal prefill:
|
| 49 |
+
|
| 50 |
+
- **Triton implementation**: easier to migrate into existing codebases, with lower
|
| 51 |
+
integration cost and faster iteration.
|
| 52 |
+
- **FA3 implementation**: higher absolute performance on supported hardware.
|
| 53 |
+
|
| 54 |
+
| batch | max_seq_len | image_token_num | triton (ms) | fa3 (ms) | speedup (×) |
|
| 55 |
+
| ----: | ----------: | --------------: | -------------: | ----------: | ----------: |
|
| 56 |
+
| 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
|
| 57 |
+
| 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
|
| 58 |
+
| 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
|
| 59 |
+
| 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
|
| 60 |
+
| 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
|
| 61 |
+
| 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
|
| 62 |
+
| 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
|
| 63 |
+
| 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
|
| 64 |
+
| 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
|
| 65 |
+
| 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
|
| 66 |
+
| 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
|
| 67 |
+
| 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
|
| 68 |
+
| 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
|
| 69 |
+
| 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
|
| 70 |
+
| 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
### Deployment
|
| 74 |
+
|
| 75 |
+
For a concise deployment runbook (Docker image, startup command, and API tests),
|
| 76 |
+
see [`deployment.md`](./deployment.md).
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
### Generation Performance
|
| 80 |
+
|
| 81 |
+
The table below is the benchmark template for **2048x2048** image generation.
|
| 82 |
+
Fill in measured numbers for each machine and deployment profile.
|
| 83 |
+
|
| 84 |
+
| Machine Type | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
|
| 85 |
+
| ---------- | ----------------- | --------------------------: | ---------------------: |
|
| 86 |
+
| H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
|
| 87 |
+
| H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
|
| 88 |
+
| 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
|
| 89 |
+
| L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
|
| 90 |
+
|
| 91 |
+
In Neo, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
|
docs/showcases.md
CHANGED
|
@@ -11,11 +11,73 @@ open the full-resolution render.
|
|
| 11 |
|
| 12 |
## Text-to-Image
|
| 13 |
|
| 14 |
-
The
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
[`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl).
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
## Image Editing
|
|
@@ -25,13 +87,82 @@ edit instruction rendered along the bottom of each compare tile. The same
|
|
| 25 |
unified model handles single-image attribute / style / relighting edits
|
| 26 |
and multi-reference (subject + accessory + pose) composition.
|
| 27 |
|
|
|
|
|
|
|
| 28 |
Reproducible prompts are in
|
| 29 |
[`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl).
|
| 30 |
|
| 31 |
-
| | |
|
| 32 |
-
| :---: | :---: |
|
| 33 |
-
|
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
---
|
| 37 |
|
|
@@ -40,7 +171,10 @@ Reproducible prompts are in
|
|
| 40 |
Each case below is a single rendered response from `model.interleave_gen`:
|
| 41 |
the model first runs a `<think>...</think>` reasoning block that produces
|
| 42 |
intermediate images, then emits the final interleaved text-and-image
|
| 43 |
-
answer
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
|
| 46 |
| |
|
|
|
|
| 11 |
|
| 12 |
## Text-to-Image
|
| 13 |
|
| 14 |
+
The main table presents the complete n × 3 grid layouts, covering landscape, square, and portrait formats at different resolutions.
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
#### 🖼️ *Text-to-Image (General)*
|
| 18 |
+
|
| 19 |
+
Reproducible prompts are in
|
| 20 |
+
[`examples/t2i/data/samples.jsonl`](../examples/t2i/data/samples.jsonl).
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
| | | |
|
| 24 |
+
| :---: | :---: | :---: |
|
| 25 |
+
| [<img width="300" alt="t2i general dense face hd 07" src="./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp">](./assets/showcases/t2i_general/16_9_dense_face_hd_07.webp) | [<img width="300" alt="t2i general dense text rendering 18" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_18.webp) | [<img width="300" alt="t2i general dense text rendering 12" src="./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp">](./assets/showcases/t2i_general/16_9_dense_text_rendering_12.webp) |
|
| 26 |
+
| [<img width="260" alt="t2i general face hd 13" src="./assets/showcases/t2i_general/1_1_face_hd_13.webp">](./assets/showcases/t2i_general/1_1_face_hd_13.webp) | [<img width="260" alt="t2i general face hd 17" src="./assets/showcases/t2i_general/1_1_face_hd_17.webp">](./assets/showcases/t2i_general/1_1_face_hd_17.webp) | [<img width="260" alt="t2i general face hd 07" src="./assets/showcases/t2i_general/1_1_dense_artistic_10.webp">](./assets/showcases/t2i_general/1_1_dense_artistic_10.webp) |
|
| 27 |
+
| [<img width="260" alt="t2i general landscape 06" src="./assets/showcases/t2i_general/1_1_landscape_06.webp">](./assets/showcases/t2i_general/1_1_landscape_06.webp) | [<img width="260" alt="t2i general dense landscape 12" src="./assets/showcases/t2i_general/1_1_dense_landscape_12.webp">](./assets/showcases/t2i_general/1_1_dense_landscape_12.webp) | [<img width="260" alt="t2i general landscape 07" src="./assets/showcases/t2i_general/1_1_landscape_07.webp">](./assets/showcases/t2i_general/1_1_landscape_07.webp) |
|
| 28 |
+
| [<img width="200" alt="t2i general portrait artistic 02 a" src="./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp">](./assets/showcases/t2i_general/9_16_dense_face_hd_10.webp) | [<img width="200" alt="t2i general portrait artistic 02 b" src="./assets/showcases/t2i_general/9_16_human_pose_11.webp">](./assets/showcases/t2i_general/9_16_human_pose_11.webp) | [<img width="200" alt="t2i general portrait artistic 07" src="./assets/showcases/t2i_general/9_16_artistic_07.webp">](./assets/showcases/t2i_general/9_16_artistic_07.webp) |
|
| 29 |
+
| [<img width="200" alt="t2i general portrait text rendering 02" src="./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp">](./assets/showcases/t2i_general/9_16_sensenova_u1_31.webp) | [<img width="200" alt="t2i general portrait dense landscape 05" src="./assets/showcases/t2i_general/9_16_dense_landscape_05.webp">](./assets/showcases/t2i_general/9_16_dense_landscape_05.webp) | [<img width="200" alt="t2i general portrait dense artistic 11" src="./assets/showcases/t2i_general/9_16_dense_artistic_11.webp">](./assets/showcases/t2i_general/9_16_dense_artistic_11.webp) |
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
#### 🖼️ *Text-to-Image (Reasoning)*
|
| 33 |
+
|
| 34 |
+
Reproducible prompts are in
|
| 35 |
+
[`examples/t2i/data/sample_reasoning.jsonl`](../examples/t2i/data/sample_reasoning.jsonl).
|
| 36 |
+
|
| 37 |
+
<table>
|
| 38 |
+
<tr>
|
| 39 |
+
<th style="width: 20%">Original Text</th>
|
| 40 |
+
<th style="width: 50%">Reasoning Process</th>
|
| 41 |
+
<th style="width: 30%">Resulting Image</th>
|
| 42 |
+
</tr>
|
| 43 |
+
<tr>
|
| 44 |
+
<td style="vertical-align: top;">The playful craft that embodies Russian cultural charm</td>
|
| 45 |
+
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is the matryoshka, identified as a Russian craft. Essential modifiers include playful and Russian cultural charm. The relation is that the craft embodies the charm. 2. <b>Reasoning Process:</b> The prompt identifies the matryoshka, the iconic Russian wooden doll set. 3. <b>Establish the frame:</b> The frame captures the matryoshka set in the foreground. The composition focuses on the Russian craft to show the playful nature of the doll set. 4. <b>Set the lighting and color:</b> Lighting illuminates the matryoshka to reveal the Russian cultural charm. The color palette supports the playful craft aesthetic. 5. <b>Lock the style:</b> The style emphasizes the wooden nature of the Russian craft. The finish reflects the charm of the matryoshka. 6. <b>Explicit Prompt:</b> A set of three colorful, hand-painted wooden matryoshka dolls arranged by size on a rustic wooden table, bright floral patterns, soft natural light.</div></td>
|
| 46 |
+
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/1_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 47 |
+
</tr>
|
| 48 |
+
<tr>
|
| 49 |
+
<td style="vertical-align: top;">A typical dish from the country where Naples is located</td>
|
| 50 |
+
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a Neapolitan pizza presented as a typical dish. The context specifies Naples, Italy, as the country of origin for this food item. 2. <b>Reasoning Process:</b> Naples is in Italy, and a classic dish is a Neapolitan pizza. 3. <b>Establish the frame:</b> The Neapolitan pizza is captured in a close-up shot that fills the central frame. The angle is slightly elevated to show the round form of the dish clearly. 4. <b>Set the lighting and color:</b> Soft lighting illuminates the surface of the Neapolitan pizza to reveal texture. Warm tones dominate the color palette, emphasizing the baked nature of the dish. 5. <b>Lock the style:</b> The image utilizes a realistic photographic style with sharp focus on the main subject. The finish appears natural and appetizing, suitable for food documentation. 6. <b>Explicit Prompt:</b> A delicious Neapolitan pizza with a soft, charred crust, tomato sauce, and fresh mozzarella, served on a rustic wooden table, realistic food photography.</div></td>
|
| 51 |
+
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/2_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 52 |
+
</tr>
|
| 53 |
+
<tr>
|
| 54 |
+
<td style="vertical-align: top;">A gigantic bubble in the immediate foreground with a small town barely visible inside</td>
|
| 55 |
+
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The prompt requires a gigantic bubble positioned in the immediate foreground. A small town must be visible inside the bubble. 2. <b>Reasoning Process:</b> The bubble acts as a transparent lens, potentially distorting the town's appearance due to its curvature. 3. <b>Establish the frame:</b> The shot scale is an extreme close-up, placing the bubble so close it fills the majority of the frame. The focus plane is set on the bubble's surface, rendering the background outside the town soft and out of focus. 4. <b>Set the lighting and color:</b> Natural daylight creates bright specular highlights on the upper surface of the bubble. The color palette is dominated by the cool transparency of the bubble and the muted tones of the town within. 5. <b>Lock the style:</b> The overall image style is photorealistic with a high level of detail on the bubble's texture. The rendering captures the glossy finish and physical properties of the bubble accurately. 6. <b>Explicit Prompt:</b> An extreme close-up of a massive, transparent soap bubble floating in the immediate foreground, with a tiny, detailed miniature town visible inside the curvature, refracted and magnified by the surface tension, soft natural daylight, dreamlike, high detail, photorealistic texture.</div></td>
|
| 56 |
+
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/3_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 57 |
+
</tr>
|
| 58 |
+
<tr>
|
| 59 |
+
<td style="vertical-align: top;">A chocolate bar left in direct sunlight, highlighting the state of the chocolate</td>
|
| 60 |
+
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a chocolate bar positioned in direct sunlight. The focus is on the state of the chocolate, specifically how the heat affects it. 2. <b>Reasoning Process:</b> Heat causes chocolate to melt, losing its structured shape and becoming a viscous, glossy liquid. 3. <b>Establish the frame:</b> The composition is a close-up shot that fills the frame with the chocolate bar to emphasize detail. The angle is slightly elevated to show the top surface and the pooling liquid clearly. 4. <b>Build the environment:</b> The chocolate bar rests on a generic surface that supports the object without distracting from the main subject. The background is blurred to keep attention on the foreground elements and the chocolate. 5. <b>Set the lighting and color:</b> Direct sunlight creates bright highlights on the melting chocolate, emphasizing its glossy texture. The lighting is warm and intense, casting distinct shadows and illuminating the rich brown colors of the liquid. 6. <b>Explicit Prompt:</b> A close-up of a melting chocolate bar on a surface, with the edges losing their defined shape and pooling into a glossy, viscous puddle under the heat of the sun.</div></td>
|
| 61 |
+
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/6_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 62 |
+
</tr>
|
| 63 |
+
<tr>
|
| 64 |
+
<td style="vertical-align: top;">A solution of calcium carbonate reacting with acetic acid</td>
|
| 65 |
+
<td><div style="max-height: 200px; overflow-y: auto;">1. <b>Instruction Understanding:</b> The core subject is a solution of calcium carbonate and acetic acid. The prompt specifies the reacting state of the chemical mixture. 2. <b>Reasoning Process:</b> The reaction produces carbon dioxide gas, which would be visible as a steady stream of bubbles rising through the liquid. 3. <b>Establish the frame:</b> The camera frames the solution closely to capture the details of the reaction. The composition centers on the liquid where the gas is visible. 4. <b>Set the lighting and color:</b> The liquid appears clear, allowing the white bubbles to stand out distinctly. The lighting is bright and even to illuminate the stream of gas. 5. <b>Lock the style:</b> The image maintains a realistic photographic style suitable for scientific observation. The focus is sharp on the reacting solution and bubbles. 6. <b>Explicit Prompt:</b> A test tube filled with a clear liquid and a rapid, effervescent stream of carbon dioxide bubbles rising to the surface, laboratory experiment.</div></td>
|
| 66 |
+
<td style="vertical-align: top;"><img src="./assets/showcases/t2i_reasoning/7_reasoning.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 67 |
+
</tr>
|
| 68 |
+
</table>
|
| 69 |
+
|
| 70 |
+
#### 🖼️ *Text-to-Image (Infographics)*
|
| 71 |
+
|
| 72 |
+
Reproducible prompts are in
|
| 73 |
[`examples/t2i/data/samples_infographic.jsonl`](../examples/t2i/data/samples_infographic.jsonl).
|
| 74 |
|
| 75 |
+
| | | |
|
| 76 |
+
| :---: | :---: | :---: |
|
| 77 |
+
| [<img width="300" alt="t2i landscape 0001" src="./assets/showcases/t2i_infographic/0001_2720x1536.webp">](./assets/showcases/t2i_infographic/0001_2720x1536.webp) | [<img width="300" alt="t2i landscape 0002" src="./assets/showcases/t2i_infographic/0002_2720x1536.webp">](./assets/showcases/t2i_infographic/0002_2720x1536.webp) | [<img width="300" alt="t2i landscape 0003" src="./assets/showcases/t2i_infographic/0003_2720x1536.webp">](./assets/showcases/t2i_infographic/0003_2720x1536.webp) |
|
| 78 |
+
| [<img width="300" alt="t2i square 0004" src="./assets/showcases/t2i_infographic/0004_2048x2048.webp">](./assets/showcases/t2i_infographic/0004_2048x2048.webp) | [<img width="300" alt="t2i square 0005" src="./assets/showcases/t2i_infographic/0005_2048x2048.webp">](./assets/showcases/t2i_infographic/0005_2048x2048.webp) | [<img width="300" alt="t2i square 0006" src="./assets/showcases/t2i_infographic/0006_2048x2048.webp">](./assets/showcases/t2i_infographic/0006_2048x2048.webp) |
|
| 79 |
+
| [<img width="200" alt="t2i portrait 0007" src="./assets/showcases/t2i_infographic/0007_1536x2720.webp">](./assets/showcases/t2i_infographic/0007_1536x2720.webp) | [<img width="200" alt="t2i portrait 0008" src="./assets/showcases/t2i_infographic/0008_1536x2720.webp">](./assets/showcases/t2i_infographic/0008_1536x2720.webp) | [<img width="200" alt="t2i portrait 0009" src="./assets/showcases/t2i_infographic/0009_1536x2720.webp">](./assets/showcases/t2i_infographic/0009_1536x2720.webp) |
|
| 80 |
+
|
| 81 |
---
|
| 82 |
|
| 83 |
## Image Editing
|
|
|
|
| 87 |
unified model handles single-image attribute / style / relighting edits
|
| 88 |
and multi-reference (subject + accessory + pose) composition.
|
| 89 |
|
| 90 |
+
#### ✏️ *Image Editing (General)*
|
| 91 |
+
|
| 92 |
Reproducible prompts are in
|
| 93 |
[`examples/editing/data/samples.jsonl`](../examples/editing/data/samples.jsonl).
|
| 94 |
|
| 95 |
+
| | |
|
| 96 |
+
| :---: | :---: |
|
| 97 |
+
| <div align="center"><a href="../examples/editing/data/images/1.webp"><img width="180" alt="editing input 1" src="../examples/editing/data/images/1.webp"></a> <a href="../examples/editing/data/images/1_out.webp"><img width="180" alt="editing output 1" src="../examples/editing/data/images/1_out.webp"></a><br><sub>Change the jacket of the person on the left to bright yellow.</sub></div> | <div align="center"><a href="../examples/editing/data/images/3.webp"><img width="180" alt="editing input 3" src="../examples/editing/data/images/3.webp"></a> <a href="../examples/editing/data/images/3_out.webp"><img width="180" alt="editing output 3" src="../examples/editing/data/images/3_out.webp"></a><br><sub>在小狗头上放一个花环,并且把图片变为吉卜力风格。</sub></div> |
|
| 98 |
+
| <div align="center"><a href="../examples/editing/data/images/2.webp"><img width="180" alt="editing input 2" src="../examples/editing/data/images/2.webp"></a> <a href="../examples/editing/data/images/2_out.webp"><img width="180" alt="editing output 2" src="../examples/editing/data/images/2_out.webp"></a><br><sub>Make the person in the image smile.</sub></div> | <div align="center"><a href="../examples/editing/data/images/4.webp"><img width="180" alt="editing input 4" src="../examples/editing/data/images/4.webp"></a> <a href="../examples/editing/data/images/4_out.webp"><img width="180" alt="editing output 4" src="../examples/editing/data/images/4_out.webp"></a><br><sub>Add a bouquet of flowers.</sub></div> |
|
| 99 |
+
| <div align="center"><a href="../examples/editing/data/images/5.webp"><img width="180" alt="editing input 5" src="../examples/editing/data/images/5.webp"></a> <a href="../examples/editing/data/images/5_out.webp"><img width="180" alt="editing output 5" src="../examples/editing/data/images/5_out.webp"></a><br><sub>Turn the image into an American comic style.</sub></div> | <div align="center"><a href="../examples/editing/data/images/8.webp"><img width="180" alt="editing input 8" src="../examples/editing/data/images/8.webp"></a> <a href="../examples/editing/data/images/8_out.webp"><img width="180" alt="editing output 8" src="../examples/editing/data/images/8_out.webp"></a><br><sub>Replace the man with a woman.</sub></div> |
|
| 100 |
+
| <div align="center"><a href="../examples/editing/data/images/6.webp"><img width="180" alt="editing input 6" src="../examples/editing/data/images/6.webp"></a> <a href="../examples/editing/data/images/6_out.webp"><img width="180" alt="editing output 6" src="../examples/editing/data/images/6_out.webp"></a><br><sub>Replace the text "WARFIGHTER" to "BATTLEFIELD" in the bold orange-red font.</sub></div> | <div align="center"><a href="../examples/editing/data/images/7.webp"><img width="180" alt="editing input 7" src="../examples/editing/data/images/7.webp"></a> <a href="../examples/editing/data/images/7_out.webp"><img width="180" alt="editing output 7" src="../examples/editing/data/images/7_out.webp"></a><br><sub>Remove the person on the far right wearing a green skirt and a green top.</sub></div> |
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
#### ✏️ *Image Editing (Reasoning)*
|
| 104 |
+
|
| 105 |
+
Reproducible prompts are in
|
| 106 |
+
[`examples/editing/data/samples_reasoning.jsonl`](../examples/editing/data/samples_reasoning.jsonl).
|
| 107 |
+
|
| 108 |
+
<table>
|
| 109 |
+
<tr>
|
| 110 |
+
<th style="width: 20%">Original Text</th>
|
| 111 |
+
<th style="width: 30%">Original Image</th>
|
| 112 |
+
<th style="width: 20%">Reasoning Process</th>
|
| 113 |
+
<th style="width: 30%">Resulting Image</th>
|
| 114 |
+
</tr>
|
| 115 |
+
<tr>
|
| 116 |
+
<td style="vertical-align: top;">Draw what it will look like one hour later.</td>
|
| 117 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 118 |
+
<td><div style="max-height: 200px; overflow-y: auto;">
|
| 119 |
+
1. <b>Source Image Analysis:</b> The source image shows a glass cup of hot tea with steeping tea leaves, and the water appears relatively clear. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance one hour later. 3. <b>Reasoning Process:</b> Over time, tannins and pigments leach out, making the tea noticeably darker and more uniformly colored, and the leaves may look more swollen and darker from soaking. 4. <b>Expected Visual Changes:</b> The expected visible result is a deeper amber-to-brown tea color and more fully saturated liquid. 5. <b>Elements to Preserve:</b> The glass cup, scattered leaves around it, background, and camera angle should remain unchanged. 6. <b>Explicit Edit Prompt:</b> Edit the tea liquid to a much darker, more saturated amber-brown color as if fully steeped, and make the tea leaves look slightly darker and more swollen, while keeping the glass cup, surrounding leaves, background, and framing unchanged.</div></td>
|
| 120 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/034_temporal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 121 |
+
</tr>
|
| 122 |
+
<tr>
|
| 123 |
+
<td style="vertical-align: top;">Draw what it will look like immediately after someone stands up from sitting on it for a long time.</td>
|
| 124 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 125 |
+
<td><div style="max-height: 200px; overflow-y: auto;">
|
| 126 |
+
1. <b>Source Image Analysis:</b> The source image shows a fluffy lime-green beanbag chair that looks evenly plump and undisturbed on a white background. 2. <b>Instruction Understanding:</b> The edit instruction asks for its appearance immediately after someone stood up from sitting there for a long time. 3. <b>Reasoning Process:</b> Prolonged weight compresses the fabric and internal fill, leaving a depressed seat area, wrinkles radiating outward, and a slowly recovering shape. 4. <b>Expected Visual Changes:</b> The visible result should be a noticeable dip and creasing where a person was seated. 5. <b>Elements to Preserve:</b> The background, beanbag color, lighting, and camera angle should remain unchanged while only the beanbag’s shape shows the compression. 6. <b>Explicit Edit Prompt:</b> Edit the beanbag chair to show a clear seated depression in the center with surrounding wrinkles and slightly compressed fabric, while keeping the white background, lighting, and camera angle unchanged.</div></td>
|
| 127 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/036_causal_reasoning_draw_what_it_will_look_like_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 128 |
+
</tr>
|
| 129 |
+
<tr>
|
| 130 |
+
<td style="vertical-align: top;">Draw an image showing the side view of the provided traffic cone.</td>
|
| 131 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si.png" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 132 |
+
<td><div style="max-height: 200px; overflow-y: auto;">
|
| 133 |
+
1. <b>Source Image Analysis:</b> The source image shows a 3D perspective view of a traffic cone. 2. <b>Instruction Understanding:</b> The instruction asks for a side view. 3. <b>Reasoning Process:</b> A side view of a standard traffic cone results in a triangular silhouette with a flat rectangular base. 4. <b>Expected Visual Changes:</b> The perspective is flattened into this 2D-like geometric profile. 5. <b>Elements to Preserve:</b> The cone's height and color should remain consistent with the original object. 6. <b>Explicit Edit Prompt:</b> Edit the perspective view into a flat side-profile silhouette of a triangle with a rectangular base, keeping the red color and proportions unchanged.</div></td>
|
| 134 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/039_spatial_reasoning_draw_an_image_showing_the_si_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 135 |
+
</tr>
|
| 136 |
+
<tr>
|
| 137 |
+
<td style="vertical-align: top;">Change the water to high-concentration saltwater</td>
|
| 138 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 139 |
+
<td><div style="max-height: 200px; overflow-y: auto;">
|
| 140 |
+
1. <b>Source Image Analysis:</b> The source image shows an egg resting at the bottom of a glass of water. 2. <b>Instruction Understanding:</b> The instruction asks to change the medium to high-concentration saltwater. 3. <b>Reasoning Process:</b> Saltwater is denser than fresh water, which increases the buoyant force on the egg. 4. <b>Expected Visual Changes:</b> As density increases, the egg will overcome gravity and float higher or suspend in the middle of the liquid. 5. <b>Elements to Preserve:</b> The glass and the egg's appearance should remain consistent, focusing on the shift in the egg's vertical position. 6. <b>Explicit Edit Prompt:</b> Edit the position of the egg so it is floating in the middle of the liquid instead of resting on the bottom, while keeping the glass and the egg's appearance unchanged.</div></td>
|
| 141 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/042_physics_change_the_water_to_high-con_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 142 |
+
</tr>
|
| 143 |
+
<tr>
|
| 144 |
+
<td style="vertical-align: top;">What the fruit looks like when ripe in the picture</td>
|
| 145 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 146 |
+
<td><div style="max-height: 200px; overflow-y: auto;">
|
| 147 |
+
1. <b>Source Image Analysis:</b> The source image shows green, unripe bananas. 2. <b>Instruction Understanding:</b> The instruction asks for the appearance of the fruit when ripe. 3. <b>Reasoning Process:</b> Ripening involves a breakdown of chlorophyll and the production of sugars, which turns the skin from green to yellow and often causes small brown sugar spots to appear. 4. <b>Expected Visual Changes:</b> The color and texture of the peel should transition to a ripe state. 5. <b>Elements to Preserve:</b> The shape of the bananas and the white background should remain constant. 6. <b>Explicit Edit Prompt:</b> Edit the green bananas to be bright yellow with small brown spots, while keeping the original shape and white background unchanged.</div></td>
|
| 148 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/044_biology_what_the_fruit_looks_like_wh_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 149 |
+
</tr>
|
| 150 |
+
<tr>
|
| 151 |
+
<td style="vertical-align: top;">Correct the unreasonable part in the image.</td>
|
| 152 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 153 |
+
<td><div style="max-height: 200px; overflow-y: auto;">
|
| 154 |
+
1. <b>Source Image Analysis:</b> The source image shows a kettle pouring water onto a mug, but the stream is misaligned and missing the cup. 2. <b>Instruction Understanding:</b> The instruction asks to fix the physical inconsistency. 3. <b>Reasoning Process:</b> The water stream must be redirected to connect the spout to the mug, maintaining the trajectory of gravity. 4. <b>Expected Visual Changes:</b> The water stream will be redirected to connect the spout to the mug. 5. <b>Elements to Preserve:</b> The kettle, mug, and background must remain unchanged while the water path is corrected. 6. <b>Explicit Edit Prompt:</b> Draw a continuous water stream connecting the kettle spout to the mug, keeping the kettle, mug, and background unchanged.</div></td>
|
| 155 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/046_anomaly_correction_correct_the_unreasonable_par_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 156 |
+
</tr>
|
| 157 |
+
<tr>
|
| 158 |
+
<td style="vertical-align: top;">Modify the matrix in the image to an upper triangular matrix</td>
|
| 159 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima.jpg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 160 |
+
<td><div style="max-height: 200px; overflow-y: auto;">
|
| 161 |
+
1. <b>Source Image Analysis:</b> The source image shows a 2x2 matrix with values 1, 2, 3, and 4. 2. <b>Instruction Understanding:</b> The instruction asks to convert this to an upper triangular matrix. 3. <b>Reasoning Process:</b> By definition, an upper triangular matrix has zeros below the main diagonal, so the entry '3' must be changed to '0' while keeping '1', '2', and '4' as they are, and this modification satisfies the mathematical property requested. 4. <b>Expected Visual Changes:</b> The entry '3' in the lower-left position will be changed to '0'. 5. <b>Elements to Preserve:</b> The grid lines, the matrix structure, and the other entries must remain unchanged. 6. <b>Explicit Edit Prompt:</b> Change the '3' in the lower-left position to '0', while keeping the matrix structure and other entries unchanged.</div></td>
|
| 162 |
+
<td style="vertical-align: top;"><img src="../examples/editing/data/images_reasoning/047_mathematics_modify_the_matrix_in_the_ima_result.jpeg" style="max-width: 100%; max-height: 100%; object-fit: contain;"></td>
|
| 163 |
+
</tr>
|
| 164 |
+
</table>
|
| 165 |
+
|
| 166 |
|
| 167 |
---
|
| 168 |
|
|
|
|
| 171 |
Each case below is a single rendered response from `model.interleave_gen`:
|
| 172 |
the model first runs a `<think>...</think>` reasoning block that produces
|
| 173 |
intermediate images, then emits the final interleaved text-and-image
|
| 174 |
+
answer.
|
| 175 |
+
|
| 176 |
+
Reproducible prompts are in
|
| 177 |
+
[`examples/interleave/data/samples.jsonl`](../examples/interleave/data/sample.jsonl).
|
| 178 |
|
| 179 |
|
| 180 |
| |
|