arxiv:2602.00919

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Published on Jan 31

· Submitted by

Polina Fedotova on Feb 3

#1 Paper of the day

Sber Robotics Center

Upvote

186

Authors:

M. Artemyev ,

P. Fedotova ,

A. Misailidi ,

D. Nerus ,

A. Nutalapati ,

G. Sidorov ,

I. Efremov ,

S. Davidenko ,

D. Kulikov ,

M. Sultankin ,

K. Askarbek ,

E. Zalyaev ,

I. Zorin

Abstract

Green-VLA is a five-stage vision-language-action framework for real-world robot deployment that achieves generalization across different robot embodiments through multimodal training and reinforcement learning.

AI-generated summary

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.

View arXiv page View PDF Project page GitHub 25 Add to collection

Community

2pd

Paper author Paper submitter 1 day ago

TL;DR: Scaling VLA isn’t enough—you need quality-aligned trajectories + a unified action interface + staged RL refinement to get reliable cross-robot generalization. This work (1) introduces a unified R64 action space with a fixed semantic layout plus embodiment/control-type prompts and a masked BC loss so unused DoFs don’t inject spurious gradients, (2) normalizes heterogeneous demonstration speeds via optical-flow–based temporal resampling to align motion statistics across datasets, and (3) follows a staged recipe R0 → R1 → R2, where R2 RL alignment explicitly targets long-horizon consistency and error recovery. On real bimanual table cleaning (ALOHA), it reaches 69.5% first-item success vs 35.6% for the baseline and is ~2× faster (1m35s vs 2m59s). On Simpler (Google Robot), performance improves from 60.2 (R0) to 71.8 (R2). A nice practical touch: an episode-end prediction head reduces “post-success fidgeting” that can flip successes into failures.

Project Page: https://greenvla.github.io/
Code: https://github.com/greenvla/GreenVLA

librarian-bot

about 12 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

thorin072

about 6 hours ago

Hello, colleagues! Cool job! I have a question for you: is it possible to put your approach in code on Unitree G1, or will it require a change in the codebase/pipeline logic?