ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies

Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbf{ATOM-Bench}, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.

翻译：通用操作策略日益被呈现为机器人控制的基础模型，但其现实世界的泛化能力仍难以诊断。一个策略可能在已演示的任务上成功，但仍在执行细粒度原子技能或在新任务结构中重组习得技能时失败。我们提出**ATOM-Bench**，这是一个用于评估操作策略中原子技能与组合泛化的现实世界基准。ATOM-Bench将桌面操作分解为运动原子和指令原子，包含30个原子任务和24个保留的跨配对单臂与双臂机器人轨道的组合任务。我们收集了3,000个人类演示用于原子微调，并公开演示数据和评估 rollout 数据，以支持可重现的现实世界评估。策略在原子任务上微调，并在原子技能获取和保留的组合任务上进行评估。我们进一步引入原子分数（AS）和组合失败份额（CFS），以区分由薄弱原子技能导致的失败与由有限组合重用导致的失败。通过对五种代表性操作策略进行2,700次物理 rollout，我们发现当前策略能够习得简单的指令接地技能，但仍在细粒度运动原子、计数和逻辑过滤上存在困难。更重要的是，强大的原子性能并不能可靠地迁移到保留的组合任务上。ATOM-Bench为研究失败源于薄弱的运动执行、欠佳的指令接地还是有限的组合重用提供了诊断测试平台。