Robo-Imagine

Abstract

Robot learning aims to complete diverse tasks. End-to-end VLA models, achieving significant performance, but struggling on data dependency. Recently, video generation models (VGMs) as a world model provides a new perspective, enabling robots to generalize across tasks by 'imagining' future states. However, computing bottleneck leading to limited-length video, not applicable for long-term tasks. In this paper, we train a image-text conditioned robotic video generation model, named RoboImagine, aiming to generate long-term robotic manipulation videos, with visual-semantic-dynamic conformity. We build an autoregressive long-term video generation pipeline based on a VLM as task-complete-verifier, in which RoboImagine is designed with dynamic and geometric consistency augmentation to get continuous and smooth motions between clips. Systematic experiments are implemented, showing that we are able to generate longe-term robotic manipulation videos with continuous motion, achieveing average success rate increment of 150% than that of w/o augmentation method. Our method effectively generalize on unseen cases. The generated video is mapped into end-effector actions, through a visual inverse dynamic model. Robot learning aims to complete diverse tasks. End-to-end VLA models, achieving significant performance, but struggling on data dependency. Recently, video generation models (VGMs) as a world model provides a new perspective, enabling robots to generalize across tasks by 'imagining' future states. However, computing bottleneck leading to limited-length video, not applicable for long-term tasks. In this paper, we train a image-text conditioned robotic video generation model, named RoboImagine, aiming to generate long-term robotic manipulation videos, with visual-semantic-dynamic conformity. We build an autoregressive long-term video generation pipeline based on a VLM as task-complete-verifier, in which RoboImagine is designed with dynamic and geometric consistency augmentation to get continuous and smooth motions between clips. Systematic experiments are implemented, showing that we are able to generate longe-term robotic manipulation videos with continuous motion, achieveing average success rate increment of 150% than that of w/o augmentation method. Our method effectively generalize on unseen cases. The generated video is mapped into end-effector actions, through a visual inverse dynamic model.

Robo-image working principle

Robo-image is an image-text conditioned, generalized robotic video generation model across different embodiments and tasks.

Framework of our Robot-Imagine model

Robot-Imagine is a U-Net-based diffusion video generation model. Inputs: text instruction, embodiment specification, 3 condition images (\(O_{i}, O_{j}, O_{k}\), where \((i, j, k) \sim \mathrm{Uniform}\{(a, b, c) \mid 0 \leq a \leq b \leq c \leq n\}\), \(n\) is video length. (\(O_{i}, O_{j}, O_{k}, i, j, k\)) represent visual and dynamic (motion direction and speed) condition of task, in order to generate coherent and smooth long-term video, when utilizing VLM as task-complete evaluator for autoregressive video generation). Output: generated robotic manipulation video. We employ a VLM as task-completion evaluator to generate long-content video. If VLM feedbacks that task is incomplete, Robot-Imagine continues generating video until task is fulfilled. Then the visual inverse dynamic model map the generated guiding video into robot arm 7-DoF actions for task in simulation or real world.

Results

Robo-Imagine on RT-1 dataset

We evaluate our model on the RT-1 dataset, which contains 599 robotic manipulation videos with different embodiments and tasks.

Seen Tasks Results on RT-1 dataset

pick rxbar chocolate

close top drawer

knock water bottle over

move sponge near apple

open middle drawer

Un-Seen Tasks Results on RT-1 dataset

place green rice chip bag into top drawer

move 7up can near orange can

move sponge near pepsi can

pick water bottle from middle shelf of fridge

place green jalapeno chip bag into bottom drawer

Robo-Imagine on Bridge dataset

We evaluate our model on the Bridge dataset, which contains 513 robotic manipulation videos with different embodiments and tasks.

Seen Tasks Results on Bridge dataset

close fridge

flip pot upright in sink (distractors)

open microwave

put knife on cutting board

take sushi off plate

Un-Seen Tasks Results on Bridge dataset

move yellow cloth to left of fork

pick up mushroom and put it inside of the pot

place the vessel near banana

put lid on stove

take the sushi out of the pot and move it to the back left corner

Experimental results of Robo-Imagine generalization ability

(a) the distribution of 599 tasks across 6 major categories from RT-1 test dataset.
(b) the success rates of 6 major categories of tasks from RT-1 test dataset.
(c) the distribution of 513 tasks across 10 major categories from Bridge test dataset.
(d) the success rates of 10 major categories from Bridge test dataset.

VLM-Enabled-Autoregressive Long-Term Video Generation

In our model, the Vision-Language Model (VLM) serves as a task-completion evaluator, enhancing the quality of generated videos by ensuring alignment with task requirements over extended temporal horizons. Comparative results between the VLM and No-VLM configurations demonstrate that incorporating the VLM module significantly improves video generation quality, providing greater adherence to task goals than models without the VLM module.

w/o VLM

place the blue cup over the black cup

place the cloth from the red bowl in the grey bowl

move the cloth on the table

pick up the bottle and put it in the pot

w/ VLM

Generalization in unseen simulation environment

To further assess video generation capability, we first evaluated our model with RT-1 tasks in simulation environment. This environment was customized and configured as RT-1 dataset, where items were randomly placed on drawers. With task prompts different from those used in the RT-1 dataset, the model demonstrated robust tasks and environments generalization, resulting in accurately picking up the apple and lying coke can, as shown in video below. To test the model's generalization capability furthermore, we created four distinct scenarios in the simulation environment. These scenarios varied in backgrounds, desktop categories, desktop texture, random coke can poses and spacial configurations. As video shown, in each case, the robot successfully completed the assigned tasks, pick up coke can stably, further affirming the model's adaptability and robustness.

Real-world sampled experiment

Our model was subsequently deployed on robotic arms, where corresponding experiments were conducted in real-world settings. These trials yielded performance outcomes that closely matched expectations, providing strong validation for the model's practical applicability. The video results highlight the model's capacity for effective transferability across different real-world scenarios, underscoring its versatility and robustness in practical applications.

move orange near apple

move carrot to drying rack

pick orange

put orange on plate

pick apple

move banana to drying

pick banana

pick carrot

pick up chip