Robot learning aims to complete diverse tasks. End-to-end VLA models, achieving significant performance, but struggling on data dependency. Recently, video generation models (VGMs) as a world model provides a new perspective, enabling robots to generalize across tasks by 'imagining' future states. However, computing bottleneck leading to limited-length video, not applicable for long-term tasks. In this paper, we train a image-text conditioned robotic video generation model, named RoboImagine, aiming to generate long-term robotic manipulation videos, with visual-semantic-dynamic conformity. We build an autoregressive long-term video generation pipeline based on a VLM as task-complete-verifier, in which RoboImagine is designed with dynamic and geometric consistency augmentation to get continuous and smooth motions between clips. Systematic experiments are implemented, showing that we are able to generate longe-term robotic manipulation videos with continuous motion, achieveing average success rate increment of 150% than that of w/o augmentation method. Our method effectively generalize on unseen cases. The generated video is mapped into end-effector actions, through a visual inverse dynamic model. Robot learning aims to complete diverse tasks. End-to-end VLA models, achieving significant performance, but struggling on data dependency. Recently, video generation models (VGMs) as a world model provides a new perspective, enabling robots to generalize across tasks by 'imagining' future states. However, computing bottleneck leading to limited-length video, not applicable for long-term tasks. In this paper, we train a image-text conditioned robotic video generation model, named RoboImagine, aiming to generate long-term robotic manipulation videos, with visual-semantic-dynamic conformity. We build an autoregressive long-term video generation pipeline based on a VLM as task-complete-verifier, in which RoboImagine is designed with dynamic and geometric consistency augmentation to get continuous and smooth motions between clips. Systematic experiments are implemented, showing that we are able to generate longe-term robotic manipulation videos with continuous motion, achieveing average success rate increment of 150% than that of w/o augmentation method. Our method effectively generalize on unseen cases. The generated video is mapped into end-effector actions, through a visual inverse dynamic model.
Robo-image is an image-text conditioned, generalized robotic video generation model across different embodiments and tasks.
Robot-Imagine is a U-Net-based diffusion video generation model. Inputs: text instruction, embodiment specification, 3 condition images (\(O_{i}, O_{j}, O_{k}\), where \((i, j, k) \sim \mathrm{Uniform}\{(a, b, c) \mid 0 \leq a \leq b \leq c \leq n\}\), \(n\) is video length. (\(O_{i}, O_{j}, O_{k}, i, j, k\)) represent visual and dynamic (motion direction and speed) condition of task, in order to generate coherent and smooth long-term video, when utilizing VLM as task-complete evaluator for autoregressive video generation). Output: generated robotic manipulation video. We employ a VLM as task-completion evaluator to generate long-content video. If VLM feedbacks that task is incomplete, Robot-Imagine continues generating video until task is fulfilled. Then the visual inverse dynamic model map the generated guiding video into robot arm 7-DoF actions for task in simulation or real world.
We evaluate our model on the RT-1 dataset, which contains 599 robotic manipulation videos with different embodiments and tasks.
pick rxbar chocolate
close top drawer
knock water bottle over
move sponge near apple
open middle drawer
place green rice chip bag into top drawer
move 7up can near orange can
move sponge near pepsi can
pick water bottle from middle shelf of fridge
place green jalapeno chip bag into bottom drawer
We evaluate our model on the Bridge dataset, which contains 513 robotic manipulation videos with different embodiments and tasks.
close fridge
flip pot upright in sink (distractors)
open microwave
put knife on cutting board
take sushi off plate
move yellow cloth to left of fork
pick up mushroom and put it inside of the pot
place the vessel near banana
put lid on stove
take the sushi out of the pot and move it to the back left corner
(a) the distribution of 599 tasks across 6 major categories from RT-1 test dataset.
(b) the success rates of 6 major categories of tasks from RT-1 test dataset.
(c) the distribution of 513 tasks across 10 major categories from Bridge test dataset.
(d) the success rates of 10 major categories from Bridge test dataset.
In our model, the Vision-Language Model (VLM) serves as a task-completion evaluator, enhancing the quality of generated videos by ensuring alignment with task requirements over extended temporal horizons. Comparative results between the VLM and No-VLM configurations demonstrate that incorporating the VLM module significantly improves video generation quality, providing greater adherence to task goals than models without the VLM module.
w/o VLM
place the blue cup over the black cup
place the cloth from the red bowl in the grey bowl
move the cloth on the table
pick up the bottle and put it in the pot
w/ VLM
To further assess video generation capability, we first evaluated our model with RT-1 tasks in simulation environment. This environment was customized and configured as RT-1 dataset, where items were randomly placed on drawers. With task prompts different from those used in the RT-1 dataset, the model demonstrated robust tasks and environments generalization, resulting in accurately picking up the apple and lying coke can, as shown in video below. To test the model's generalization capability furthermore, we created four distinct scenarios in the simulation environment. These scenarios varied in backgrounds, desktop categories, desktop texture, random coke can poses and spacial configurations. As video shown, in each case, the robot successfully completed the assigned tasks, pick up coke can stably, further affirming the model's adaptability and robustness.
Our model was subsequently deployed on robotic arms, where corresponding experiments were conducted in real-world settings. These trials yielded performance outcomes that closely matched expectations, providing strong validation for the model's practical applicability. The video results highlight the model's capacity for effective transferability across different real-world scenarios, underscoring its versatility and robustness in practical applications.
move orange near apple
move carrot to drying rack
pick orange
put orange on plate
pick apple
move banana to drying
pick banana
pick carrot
pick up chip
pick orange
Real experimental results demonstrate that the Robo-Imagine model exhibits remarkable generalization capabilities in real-world applications. Under diverse scenarios, across different robotic arms, and with varying task instructions, the model consistently performs tasks effectively, showcasing its robust adaptability across environments and embodiments.