Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Abstract

We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework designed for zero-shot, high-level physical task planning.
Our approach integrates VLLMs into perception, control, and planning, enabling action sequence generation for novel environments from just an input image and a task description.
Wonderful Team employs a multi-agent architecture to break down complex tasks, effectively handling longer sequences and ensuring reliability through self-correction.
In real and simulated experiments, Wonderful Team demonstrates notable improvements over existing methods:
- +40% success rate on VimaBench compared to NLaP
- +30% improvement on standard tasks and +70% improvement on semantic reasoning tasks over Trajectory Generators
These results highlight the growing potential of VLLMs in high-level robotic planning, reflecting their rapid advancement over the past year.

Execution Examples

[Real] Spatial Planning

Shake the yogurt

Draw a star

Wipe the plate

[Real] Fruit Placement

“Place each fruit in the area that matches its color, if such an area exists.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Real] Price Ranking

“Based on the price tags and any discounts on the fruits, rank them from the most expensive to the cheapest and place them in the corresponding bowl.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Real] Superhero Companions

“Fruits and snacks of similar color make perfect companions. Distribute the unmatched items from the top left corner to the superheroes to help each of them have companion pairs.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Same Texture

“Put all objects with the same texture as {Object} into it”

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Same Shape

“Put all objects with the same profile as {Object} into it”

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Pick & Restore

“Put {Object_1} into {Object_2} then {Object_3}. Finally restore it into its original container”

Initial Frame

Execution Video

Initial Frame

Execution Video

Citation

If you find Wonderful Team useful in your research or applications, please consider citing it using the following BibTeX entry:

    @misc{wang2024wonderfulteam,
          title={Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs}, 
          author={Zidan Wang and Rui Shen and Bradly Stadie},
          year={2024},
          eprint={2407.19094},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2407.19094}, 
    }

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Abstract

Method

Results Overview

Execution Examples

[Real] Spatial Planning

[Real] Fruit Placement

[Real] Price Ranking

[Real] Superhero Companions

[Sim] Same Texture

[Sim] Same Shape

[Sim] Pick & Restore

Citation