Solving Robotics Problems in Zero-Shot with Vision-Language Models

Anonymous Authors
In Submission to ICLR 2025


Abstract

  • We introduce Wonderful Team, a multi-agent visual LLM (VLLM) framework for solving robotics problems in the zero-shot regime.
  • With careful engineering, we can prompt a single off-the-shelf VLLM to handle all aspects of a robotics task, from high-level planning to low-level location-extraction and action-execution.
  • Wonderful Team builds on recent advances in multi-agent LLMs to partition tasks across an agent hierarchy, making it self-corrective and able to effectively solve even long-horizon tasks. The ability of agent teams to self-correct is striking, raising success rates in many environments by over 90%.
  • Extensive experiments on both real and simulated environments demonstrate the system's capability to handle a variety of robotic tasks, including manipulation, visual goal-reaching, and visual reasoning, all in a zero-shot manner.
  • These results underscore a key point: vision-language models have progressed rapidly in the past year, and should strongly be considered as a backbone for robotics problems going forward.

Method

We propose Wonderful Team, a zero-shot, single-model, multi-agent system for solving visual robotics tasks. Taking inspiration from recent advances in the multi-agent LLM literature, our system employs specialized agents to collaboratively manage different task aspects, from high-level planning to low-level execution, within a single integrated system. In particular, we develop a multi-agent LLM system wherein each agent is responsible for a separate component of task execution: including planning, object identification and location, action proposal, memory, and self-correction.


Results Overview

Note that the results presented here are based on a selection of VIMABench tasks. For more information, please refer to the paper.

Execution Examples

[Real] Spatial Planning

Shake the yogurt

Shake the yogurt

Draw a star

Wipe the plate

[Real] Fruit Placement

“Place each fruit in the area that matches its color, if such an area exists.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Real] Price Ranking

“Based on the price tags and any discounts on the fruits, rank them from the most expensive to the cheapest and place them in the corresponding bowl.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Real] Superhero Companions

“Fruits and snacks of similar color make perfect companions. Distribute the unmatched items from the top left corner to the superheroes to help each of them have companion pairs.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Same Texture

“Put all objects with the same texture as {Object} into it”

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Same Shape

“Put all objects with the same profile as {Object} into it”

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Pick & Restore

“Put {Object_1} into {Object_2} then {Object_3}. Finally restore it into its original container”

Initial Frame

Execution Video

Initial Frame

Execution Video