Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Anonymous Authors
In Submission to TMLR


Abstract

  • We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework designed for zero-shot, high-level physical task planning.
  • Our approach integrates VLLMs into perception, control, and planning, enabling action sequence generation for novel environments from just an input image and a task description.
  • Wonderful Team employs a multi-agent architecture to break down complex tasks, effectively handling longer sequences and ensuring reliability through self-correction.
  • In real and simulated experiments, Wonderful Team demonstrates notable improvements over existing methods:
    • +40% success rate on VimaBench compared to NLaP
    • +30% improvement on standard tasks and +70% improvement on semantic reasoning tasks over Trajectory Generators
  • These results highlight the growing potential of VLLMs in high-level robotic planning, reflecting their rapid advancement over the past year.

Method

We propose Wonderful Team, a zero-shot, single-model, multi-agent system for solving visual robotics tasks. Taking inspiration from recent advances in the multi-agent LLM literature, our system employs specialized agents to collaboratively manage different task aspects, from high-level planning to low-level execution, within a single integrated system. In particular, we develop a multi-agent LLM system wherein each agent is responsible for a separate component of task execution: including planning, object identification and location, action proposal, memory, and self-correction.


Results Overview

Note that the results presented here are based on a selection of VIMABench tasks. For more information, please refer to the paper.

Execution Examples

[Real] Spatial Planning

Shake the yogurt

Shake the yogurt

Draw a star

Wipe the plate

[Real] Fruit Placement

“Place each fruit in the area that matches its color, if such an area exists.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Real] Price Ranking

“Based on the price tags and any discounts on the fruits, rank them from the most expensive to the cheapest and place them in the corresponding bowl.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Real] Superhero Companions

“Fruits and snacks of similar color make perfect companions. Distribute the unmatched items from the top left corner to the superheroes to help each of them have companion pairs.”

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Same Texture

“Put all objects with the same texture as {Object} into it”

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Same Shape

“Put all objects with the same profile as {Object} into it”

Initial Frame

Execution Video

Initial Frame

Execution Video

[Sim] Pick & Restore

“Put {Object_1} into {Object_2} then {Object_3}. Finally restore it into its original container”

Initial Frame

Execution Video

Initial Frame

Execution Video