| --- |
| language: |
| - en |
| - es |
| - fr |
| - de |
| - it |
| - hi |
| - mr |
| - sa |
| - kn |
| - te |
| - ta |
| - ml |
| - zh |
| - ja |
| - ko |
| - ar |
| - bn |
| - gu |
| - or |
| - pa |
| - ru |
| - th |
| license: gemma |
| library_name: transformers |
| tags: |
| - vision-language |
| - retrieval |
| - colbert |
| - late-interaction |
| - multimodal |
| - multilingual |
| - document-retrieval |
| - 22-languages |
| pipeline_tag: visual-document-retrieval |
| base_model: |
| - google/gemma-3-4b-it |
|
|
| datasets: |
| - Cognitive-Lab/nayanair-bench |
| model-index: |
| - name: ColNetraEmbed |
| results: |
| - task: |
| type: image-text-retrieval |
| name: Cross-Lingual Document Retrieval |
| dataset: |
| type: Cognitive-Lab/nayanair-bench |
| name: Nayana-IR Cross-Lingual |
| split: test |
| metrics: |
| - type: ndcg_at_5 |
| value: 0.637 |
| name: NDCG@5 |
| - type: recall_at_10 |
| value: 0.700 |
| name: Recall@10 |
| - type: map_at_10 |
| value: 0.610 |
| name: MAP@10 |
| - type: mrr_at_10 |
| value: 0.610 |
| name: MRR@10 |
| - task: |
| type: image-text-retrieval |
| name: Monolingual Document Retrieval |
| dataset: |
| type: Cognitive-Lab/nayanair-bench |
| name: Nayana-IR Monolingual |
| split: test |
| metrics: |
| - type: ndcg_at_5 |
| value: 0.670 |
| name: NDCG@5 |
| - type: recall_at_10 |
| value: 0.764 |
| name: Recall@10 |
| - type: map_at_10 |
| value: 0.645 |
| name: MAP@10 |
| - type: mrr_at_10 |
| value: 0.686 |
| name: MRR@10 |
| - task: |
| type: image-text-retrieval |
| name: English Document Retrieval |
| dataset: |
| type: vidore/vidore-benchmark |
| name: ViDoRe v2 |
| split: test |
| metrics: |
| - type: ndcg_at_5 |
| value: 0.551 |
| name: NDCG@5 |
| - type: recall_at_10 |
| value: 0.664 |
| name: Recall@10 |
| - type: map_at_10 |
| value: 0.445 |
| name: MAP@10 |
| - type: mrr_at_10 |
| value: 0.445 |
| name: MRR@10 |
| --- |
| # ColNetraEmbed |
|
|
|  |
|
|
|
|
| [](https://arxiv.org/abs/2512.03514) |
| [](https://github.com/adithya-s-k/colpali) |
| [](https://huggingface.co/Cognitive-Lab/ColNetraEmbed) |
| [](https://www.cognitivelab.in/blog/introducing-netraembed) |
| [](https://huggingface.co/spaces/AdithyaSK/NetraEmbed) |
| [](https://huggingface.co/Cognitive-Lab/ColNetraEmbed/blob/main/ColNetraEmbed_InferenceDemo.ipynb) |
| [](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_Gradio_Demo_final.ipynb) |
|
|
|
|
| **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations. |
|
|
| ## Model Description |
|
|
| ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim). |
|
|
| - **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations |
| - **Architecture:** ColPali with Gemma3-4B backbone |
| - **Embedding Dimension:** 128 per token |
| - **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction |
| - **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search |
|
|
| ## Paper |
|
|
| 📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)** |
|
|
| ## Installation |
|
|
| ```bash |
| pip install git+https://github.com/adithya-s-k/colpali.git |
| ``` |
|
|
| ## Quick Start |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from colpali_engine.models import ColGemma3, ColGemmaProcessor3 |
| |
| # Load model and processor |
| model_name = "Cognitive-Lab/ColNetraEmbed" |
| model = ColGemma3.from_pretrained( |
| model_name, |
| torch_dtype=torch.bfloat16, |
| device_map="cuda", |
| ) |
| processor = ColGemmaProcessor3.from_pretrained(model_name) |
| |
| # Load your images |
| images = [ |
| Image.open("document1.jpg"), |
| Image.open("document2.jpg"), |
| ] |
| |
| # Define queries |
| queries = [ |
| "What is the total revenue?", |
| "Show me the organizational chart", |
| ] |
| |
| # Process and encode |
| batch_images = processor.process_images(images).to(model.device) |
| batch_queries = processor.process_queries(queries).to(model.device) |
| |
| with torch.no_grad(): |
| image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128) |
| query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128) |
| |
| # Compute similarity scores using MaxSim |
| scores = processor.score_multi_vector( |
| qs=query_embeddings, |
| ps=image_embeddings, |
| ) # Shape: (num_queries, num_images) |
| |
| # Get best matches |
| for i, query in enumerate(queries): |
| best_idx = scores[i].argmax().item() |
| print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})") |
| ``` |
|
|
| ## Use Cases |
|
|
| - **Document Retrieval:** Search through large collections of visual documents |
| - **Visual Question Answering:** Answer questions about document content |
| - **Document Understanding:** Extract and match information from scanned documents |
| - **Cross-lingual Document Search:** Multilingual visual document retrieval |
|
|
| ## Model Details |
|
|
| - **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it) |
| - **Vision Encoder:** SigLIP |
| - **Training Data:** Multilingual document datasets |
| - **Embedding Strategy:** Multi-vector (Late Interaction) |
| - **Similarity Function:** MaxSim (Maximum Similarity) |
|
|
| ## Performance |
|
|
| ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2. |
|
|
| ### Benchmark Results |
|
|
| **Nayana-IR Cross-Lingual** |
|
|
| | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |
| |-------|:------:|:---------:|:------:|:------:| |
| | **ColNetraEmbed** | **0.637** | **0.700** | **0.610** | **0.610** | |
| | Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 | |
| | ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 | |
| | ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 | |
| | GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 | |
| | ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 | |
| | ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 | |
|
|
| **Nayana-IR Monolingual** |
|
|
| | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |
| |-------|:------:|:---------:|:------:|:------:| |
| | **ColNetraEmbed** | **0.670** | **0.764** | **0.645** | **0.686** | |
| | ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 | |
| | ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 | |
| | GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 | |
| | ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 | |
| | ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 | |
|
|
| **ViDoRe v2** |
|
|
| | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | |
| |-------|:------:|:---------:|:------:|:------:| |
| | ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 | |
| | Jina-Embeddings-v4 | 0.576 | 0.686 | - | - | |
| | GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 | |
| | ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 | |
| | **ColNetraEmbed** | **0.551** | **0.664** | **0.445** | **0.445** | |
| | ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 | |
| | ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 | |
|
|
| **Key Results:** |
| - 🏆 **Strong multilingual performance** with ColBERT-style late interaction |
| - 📈 **124% improvement** over ColPali-v1.3 on cross-lingual tasks |
| - 🌍 Supports **22 languages** across diverse script families |
| - 🔍 **Fine-grained matching** through token-level MaxSim scoring |
|
|
| **Comparison: Multi-vector vs Single-vector** |
| - ColNetraEmbed (multi-vector): More interpretable with token-level attribution |
| - NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage |
|
|
| See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{kolavi2025m3druniversalmultilingualmultimodal, |
| title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, |
| author={Adithya S Kolavi and Vyoman Jain}, |
| year={2025}, |
| eprint={2512.03514}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.IR}, |
| url={https://arxiv.org/abs/2512.03514} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the same license as the base Gemma3 model. |
|
|
| ## Acknowledgments |
|
|
| This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in). |
|
|
| Built on top of the ColPali framework and Gemma3 architecture. |
|
|