Workshop on Post-training for Robotics Foundation Models

Challenge

The challenge is structured in two rounds. In Phase 1, all teams are evaluated on a standardized real-robot bimanual manipulation benchmark using a shared dataset. In Phase 2, the top teams are selected and supported in iterative rollout collection to further improve their policies via post-training.

Phase 1 — Evaluation Tracks

Phase 1 evaluates submitted policies on 3–5 standardized real-robot tasks across two ranking tracks. Teams perform offline training on the released dataset (expert data + baseline success and failure rollouts) and submit policies for benchmark evaluation.

Single-Task Track

Policies are evaluated and ranked on each task independently. Best for specialists optimizing for individual task performance.

Multi-Task Track

A single policy is evaluated jointly across all tasks. Best for generalist policies that share representations across skills.

Task Samples & Scoring

Phase 1 includes three real-robot bimanual manipulation tasks. Select a task below to view a sample rollout video and its scoring criteria.

Scoring Criteria · `insert-mouse-battery`

Place the mouse in the center of the table with the battery slot facing right. +0.4
Insert the battery in the correct polarity, but not fully inserted. +0.4
Battery is fully inserted. +0.2

Steps are cumulative — completing the final step yields a total score of 1.0.

Left wrist

Top

Right wrist

Scoring Criteria · `tower-of-hanoi-game`

Successfully place the small ring onto the opposite peg. +0.3
Successfully place the large ring onto the middle peg. +0.3
Successfully place the small ring onto the middle peg and stack it on top of the large ring. +0.4

As long as the final configuration has the large ring on the bottom and the small ring on top, the score will be counted as 1.0.

Left wrist

Top

Right wrist

Scoring Criteria · `seal-water-bottle-cap`

Insert the straw of the bottle cap into the bottle and place the cap onto the bottle. +0.5
Tighten the cap — score scales with how firmly it is fastened. +0 → +0.5 final 0.5 – 1.0

The tightening score is graduated based on the number of turns the robot makes. A final score of 1.0 indicates the cap is fully tightened.

Left wrist

Top

Right wrist

Dataset

For Phase 1, we release a standardized real-robot bimanual dataset designed for offline training. The dataset contains four complementary components:

Standardized Real-Robot Bimanual Dataset

Expert Data

High-quality human teleoperation demonstrations on the benchmark tasks.

Baseline Success Rollouts

Trajectories where a baseline policy successfully completed the task.

Baseline Failure Rollouts

Trajectories where the baseline policy failed — useful negative signal for post-training.

Human-in-the-Loop Data

Human interventions, corrections, and preference labels collected during baseline rollouts.

All four components are intended for offline training in Phase 1. Hosted on Hugging Face Datasets.

Download on Hugging Face

Tutorial

How to Participate

Step-by-step instructions for joining the challenge — including environment setup, dataset access, baseline reproduction, evaluation protocol, and submission format — are hosted in our GitHub repositories. Reference code and starter scripts are provided so teams can get up and running quickly.

Baseline Deployment

Phase 2 — Iterative Improvement

Based on Phase 1 results, the top 3 teams advance to Phase 2. During this round, selected teams collect rollouts by deploying their own policies and use those rollouts to further improve performance via post-training.

Top 3 teams selected from Phase 1 ranking
Iterative rollout collection supported by the organizers
Final policies presented at the workshop @ RSS 2026

Schedule

Key Dates

Now → May 31, 2026
Team registration open

Now → June 10, 2026

Phase 1 evaluation window

Early June, 2026

Top 3 teams announced

June 10 – 30, 2026

Phase 2 iterative improvement

July, 2026

Workshop @ RSS — team reports & panel

Phases

Registration — team sign-up and track selection
Phase 1: Ranking — single-task & multi-task evaluation
Phase 2: Iterative Improvement — top 3 teams
Workshop @ RSS — team presentations and panel discussion

Awards

🥇

1st Place

$5,000

+ Bimanual YAM Robot

🥈

2nd Place

$3,000

🥉

3rd Place

$2,000

🏅

4th – 10th

$500 each

Top-performing teams will also be invited to give a presentation at the workshop.

Practical issues, lessons, and experiences from the challenge will be compiled into a technical report shared with the community.

Leaderboard

#	Team / Model	`insert-mouse-battery`		`tower-of-hanoi-game`		`seal-water-bottle-cap`		Average
#	Team / Model	Score	SR	Score	SR	Score	SR	Score	SR
—	Baseline pi05	87	80%	47	45%	47	35%	60.3	53.3%

#	Team / Model	`insert-mouse-battery`		`tower-of-hanoi-game`		`seal-water-bottle-cap`		Average
#	Team / Model	Score	SR	Score	SR	Score	SR	Score	SR
—	Baseline pi05	—	—	—	—	—	—	—	—

Score: progress score per task (higher is better). SR: success rate over rollouts. The final ranking uses the Average column.

Registration

Team registration is now open. Please fill in the Google Form below with your team information and track preference. Submission instructions will be shared with registered teams.

Time	Event
9:00 - 10:30	Invited Talks Karl Pertsch · Abhishek Gupta · more speakers TBA
10:30 - 11:00	Coffee Break & Participating Teams' Policy Deployment Demonstrations
11:00 - 12:00	Participating Team Reports
12:00 - 13:00	Panel Discussion & Audience Q&A

Single-Task Track

Multi-Task Track

Scoring Criteria · insert-mouse-battery

Scoring Criteria · tower-of-hanoi-game

Scoring Criteria · seal-water-bottle-cap

Standardized Real-Robot Bimanual Dataset

Expert Data

Baseline Success Rollouts

Baseline Failure Rollouts

Human-in-the-Loop Data

How to Participate

Key Dates

Phases

Scoring Criteria · `insert-mouse-battery`

Scoring Criteria · `tower-of-hanoi-game`

Scoring Criteria · `seal-water-bottle-cap`