ObjEmbed: Towards Universal Multimodal Object Embeddings

ObjEmbed is a multimodal embedding model designed to align specific image regions (objects) with textual descriptions. Unlike global embedding models, ObjEmbed decomposes an image into multiple regional embeddings along with global embeddings, supporting tasks such as visual grounding, local image retrieval, and global image retrieval.

Key Features

Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality.
Versatility: It seamlessly handles both region-level and image-level tasks.
Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency.

Resources

Paper: ObjEmbed: Towards Universal Multimodal Object Embeddings
Code: Official GitHub Repository

Citation

If you find ObjEmbed helpful for your research, please consider citing:

@article{fu2026objembed,
  title={ObjEmbed: Towards Universal Multimodal Object Embeddings},
  author={Fu, Shenghao and Su, Yukun and Rao, Fengyun and LYU, Jing and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2602.01753},
  year={2026}
}

Downloads last month: 144

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including fushh7/ObjEmbed-2B

ObjEmbed

Collection

ObjEmbed: Towards Universal Multimodal Object Embeddings • 4 items • Updated 2 days ago

Paper for fushh7/ObjEmbed-2B

ObjEmbed: Towards Universal Multimodal Object Embeddings

Paper • 2602.01753 • Published Feb 2 • 5