Geo-R1
Collection
Geo-R1 series models • 2 items • Updated • 1
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("miniHui/Geo-R1")
model = AutoModelForImageTextToText.from_pretrained("miniHui/Geo-R1")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))This repository contains the Geo-R1 model, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models, as introduced in the paper:
Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning
Geo-R1 combines "thinking scaffolding" (supervised fine-tuning on synthetic chain-of-thought exemplars) and an "elevating" stage using GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This approach enables models to connect visual cues with geographic priors and harness reasoning for accurate prediction, achieving state-of-the-art performance across various geospatial reasoning benchmarks.
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="miniHui/Geo-R1") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)