The challenge is structured in two rounds. In Phase 1, all teams are evaluated on a standardized real-robot bimanual manipulation benchmark using a shared dataset. In Phase 2, the top teams are selected and supported in iterative rollout collection to further improve their policies via post-training.
Phase 1 — Evaluation TracksPhase 1 evaluates submitted policies on 3–5 standardized real-robot tasks across two ranking tracks. Teams perform offline training on the released dataset (expert data + baseline success and failure rollouts) and submit policies for benchmark evaluation.
Policies are evaluated and ranked on each task independently. Best for specialists optimizing for individual task performance.
A single policy is evaluated jointly across all tasks. Best for generalist policies that share representations across skills.
Phase 1 includes three real-robot bimanual manipulation tasks. Select a task below to view a sample rollout video and its scoring criteria.
insert-mouse-batterytower-of-hanoi-gameseal-water-bottle-capFor Phase 1, we release a standardized real-robot bimanual dataset designed for offline training. The dataset contains four complementary components:
High-quality human teleoperation demonstrations on the benchmark tasks.
Trajectories where a baseline policy successfully completed the task.
Trajectories where the baseline policy failed — useful negative signal for post-training.
Human interventions, corrections, and preference labels collected during baseline rollouts.
All four components are intended for offline training in Phase 1. Hosted on Hugging Face Datasets.
Download on Hugging FaceStep-by-step instructions for joining the challenge — including environment setup, dataset access, baseline reproduction, evaluation protocol, and submission format — are hosted in our GitHub repositories. Reference code and starter scripts are provided so teams can get up and running quickly.
Based on Phase 1 results, the top 3 teams advance to Phase 2. During this round, selected teams collect rollouts by deploying their own policies and use those rollouts to further improve performance via post-training.
| # | Team / Model | insert-mouse-battery |
tower-of-hanoi-game |
seal-water-bottle-cap |
Average | ||||
|---|---|---|---|---|---|---|---|---|---|
| Score | SR | Score | SR | Score | SR | Score | SR | ||
| — |
Baseline
pi05
|
87 | 80% | 47 | 45% | 47 | 35% | 60.3 | 53.3% |
| # | Team / Model | insert-mouse-battery |
tower-of-hanoi-game |
seal-water-bottle-cap |
Average | ||||
|---|---|---|---|---|---|---|---|---|---|
| Score | SR | Score | SR | Score | SR | Score | SR | ||
| — |
Baseline
pi05
|
— | — | — | — | — | — | — | — |
Score: progress score per task (higher is better). SR: success rate over rollouts. The final ranking uses the Average column.
RegistrationTeam registration is now open. Please fill in the Google Form below with your team information and track preference. Submission instructions will be shared with registered teams.
Register your team| Time | Event |
|---|---|
| 9:00 - 10:30 | Invited Talks Karl Pertsch · Abhishek Gupta · more speakers TBA |
| 10:30 - 11:00 | Coffee Break & Participating Teams' Policy Deployment Demonstrations |
| 11:00 - 12:00 | Participating Team Reports |
| 12:00 - 13:00 | Panel Discussion & Audience Q&A |
Discussion topics for this workshop include, but are not limited to:
For any questions about the workshop or the challenge, please reach out to:
Shiduo Zhang · sdzhang23@m.fudan.edu.cn