arxiv:2603.25240

Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells

Published on Mar 26

· Submitted by

Han Zhang on Apr 1

DAMO Academy

Upvote

Authors:

Han Zhang ,

Chaohao Yuan ,

Yu Rong

Abstract

Lingshu-Cell is a masked discrete diffusion model that learns transcriptomic state distributions and enables conditional simulation of cellular perturbations across diverse tissues and species.

AI-generated summary

Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. Existing foundation models for single-cell transcriptomics provide powerful static representations, but they do not explicitly model the distribution of cellular states for generative simulation. Here, we introduce Lingshu-Cell, a masked discrete diffusion model that learns transcriptomic state distributions and supports conditional simulation under perturbation. By operating directly in a discrete token space that is compatible with the sparse, non-sequential nature of single-cell transcriptomic data, Lingshu-Cell captures complex transcriptome-wide expression dependencies across approximately 18,000 genes without relying on prior gene selection, such as filtering by high variability or ranking by expression level. Across diverse tissues and species, Lingshu-Cell accurately reproduces transcriptomic distributions, marker-gene expression patterns and cell-subtype proportions, demonstrating its ability to capture complex cellular heterogeneity. Moreover, by jointly embedding cell type or donor identity with perturbation, Lingshu-Cell can predict whole-transcriptome expression changes for novel combinations of identity and perturbation. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and in predicting cytokine-induced responses in human PBMCs. Together, these results establish Lingshu-Cell as a flexible cellular world model for in silico simulation of cell states and perturbation responses, laying the foundation for a new paradigm in biological discovery and perturbation screening.

View arXiv page View PDF Project page Add to collection

Community

bibona

Paper author Paper submitter 14 days ago

✨ Highlights

Lingshu-Cell introduces a generative cellular world model for single-cell transcriptomics based on a masked discrete diffusion framework.
Lingshu-Cell performs transcriptome-wide modeling over ~18,000 genes directly in a discrete token space that is compatible with the sparse, non-sequential nature of scRNA-seq data, without prior gene selection.
Lingshu-Cell reproduces realistic cell populations across diverse tissues and species, capturing marker-gene expression patterns, cell-subtype proportions, and transcriptomic distributions.
Lingshu-Cell achieves strong performance in response prediction under both genetic and cytokine perturbations.

avahal

13 days ago

one question that sticks with me: how robust is the fixed discrete token vocabulary to skewed expression, especially for rare transcripts that often drive fine-grained subtypes?

they claim no gene filtering and model all ~18k genes in a single vocab, but would varying tokenization granularity or using adaptive binning change the recovery of marker genes and cell-subtype proportions?

the arxivlens breakdown helped me parse the method details, particularly the discrete diffusion in token space and the conditioning scheme.

have you done any ablation with different vocab sizes or quantization levels to quantify how much the discreteness itself, beyond model size, drives performance?

overall it's a neat step toward truly generative, perturbation-aware cell modeling, and i’m curious how you see this scaling to longitudinal trajectories or multi-omics in the future.

bibona

Paper author 13 days ago

Thanks for the thoughtful question.

Just to clarify one point: in our setup, each cell is represented over a fixed list of ~18k genes, while the vocab is used for discretized expression values rather than gene identities.

The quantization is also not uniform: low counts are kept at fine resolution, while higher counts are compressed more coarsely, but still with roughly two significant digits preserved. So the design is meant to retain low-abundance signals while handling the heavy-tailed count range efficiently.

So for skewed expression distributions, especially low-abundance signals, we do not think the current discretization should wash them out in practice. Empirically, it also works well across the single-cell transcriptomic datasets in the paper, including good recovery of marker-gene patterns and subtype proportions. A more dedicated ablation over different quantization schemes or adaptive binning would definitely be worth exploring next.

We also agree that longitudinal trajectories and multi-omics are meaningful directions, and both feel like very natural next steps for this line of work.

librarian-bot

13 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

jackkuo

1 day ago

In the paper, you write:

“Given the strong performance of Lingshu-Cell in modeling cellular gene expression distributions in the unconditional setting, we next asked whether the same framework could support conditional generation.”

I would like to ask for a clarification on this point. This sentence may lead readers to interpret the subsequent conditional generation / perturbation prediction results as being built on top of the previously trained unconditional model. Could you please clarify:

For the conditional training used in Figure 3 and Figure 4, was the model initialized from the unconditional model and then further fine-tuned?
If not, and the conditional models were trained independently, did you perform any ablation study comparing “fine-tuning from the unconditional model” versus “training the conditional model directly from scratch”?
If such a comparison was not performed, would you say that the benefit of unconditional pretraining for downstream conditional perturbation prediction is empirically established in this work, or is the connection mainly that they share the same overall framework rather than a sequential training pipeline?

The reason I ask is that this point affects how readers should interpret the relationship between Figure 2 and Figures 3/4: whether they represent a progressive pipeline of “foundation model → conditional fine-tuning,” or rather different training settings under the same methodological framework.

bibona

Paper author 1 day ago

Thanks for raising this. To clarify, the conditional models in Figures 3 and 4 were trained independently, rather than initialized from the unconditional model in Figure 2 and then fine-tuned. So in this paper, Figure 2 and Figures 3/4 should be understood as different training settings under the same framework, not a sequential “pretraining → fine-tuning” pipeline.

We did not run an ablation comparing fine-tuning from the unconditional model versus training the conditional model from scratch, mainly because of computational constraints. Still, the conditional models achieve strong results without relying on unconditional pretraining. We think this suggests that the framework itself already provides a strong basis for conditional perturbation modeling. Using unconditional pretraining as an initialization step is something we plan to explore in the next iteration.