Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
Abstract
Green-VLA is a five-stage vision-language-action framework for real-world robot deployment that achieves generalization across different robot embodiments through multimodal training and reinforcement learning.
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.
Community
TL;DR: Scaling VLA isn’t enough—you need quality-aligned trajectories + a unified action interface + staged RL refinement to get reliable cross-robot generalization. This work (1) introduces a unified R64 action space with a fixed semantic layout plus embodiment/control-type prompts and a masked BC loss so unused DoFs don’t inject spurious gradients, (2) normalizes heterogeneous demonstration speeds via optical-flow–based temporal resampling to align motion statistics across datasets, and (3) follows a staged recipe R0 → R1 → R2, where R2 RL alignment explicitly targets long-horizon consistency and error recovery. On real bimanual table cleaning (ALOHA), it reaches 69.5% first-item success vs 35.6% for the baseline and is ~2× faster (1m35s vs 2m59s). On Simpler (Google Robot), performance improves from 60.2 (R0) to 71.8 (R2). A nice practical touch: an episode-end prediction head reduces “post-success fidgeting” that can flip successes into failures.
Project Page: https://greenvla.github.io/
Code: https://github.com/greenvla/GreenVLA
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization (2026)
- Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding (2025)
- PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation (2026)
- GR-Dexter Technical Report (2025)
- Mind to Hand: Purposeful Robotic Control via Embodied Reasoning (2025)
- CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos (2026)
- Robotic VLA Benefits from Joint Learning with Motion Image Diffusion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hello, colleagues! Cool job! I have a question for you: is it possible to put your approach in code on Unitree G1, or will it require a change in the codebase/pipeline logic?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper