Many everyday robot manipulation skills are affordance-dependent, with success determined by whether the robot contacts the functional object region required by the subsequent action. Current simulation data generators obtain contacts from generic grasp estimators or per-object manual contact annotations, but generic estimators rank stable grasps without task semantics and often select contacts that are misaligned with the downstream action, while manual contact annotations must be rewritten for each new object and task. To solve these challenges, we introduce AffordSim, a scalable data generator and benchmark that integrates open-vocabulary 3D affordance prediction into simulation-based trajectory generation. Given a natural-language task description, AffordSim synthesizes a task-relevant scene, emits affordance queries, grounds them on object surfaces, samples region-conditioned grasps, and selects executable candidates with motion planning. It further randomizes object pose, texture, lighting, image noise, and cross-viewpoint backgrounds for sim-to-real transfer. We instantiate AffordSim as a 50-task benchmark across diverse manipulation skills, five robot embodiments, and 500+ rigid and articulated objects. AffordSim achieves 93% of the trajectory collection success rate of manual contact annotations on affordance-critical tasks and 89% on hard composite tasks. Vision-language-action policies trained on AffordSim data transfer zero-shot to a real Franka FR3, reaching 24% average success.
翻译:许多日常机器人操作技能依赖于可操作性,其成功取决于机器人是否接触后续动作所需的功能性物体区域。现有仿真数据生成器通过通用抓取估计器或逐物体手动接触标注获取接触点,但通用估计器仅评估无任务语义的稳定抓取,常选择与下游动作不一致的接触点;手动接触标注则需为每个新物体和新任务重新编写。为应对这些挑战,我们提出AffordSim——一个将开放词汇三维可操作性预测集成至基于仿真的轨迹生成的可扩展数据生成器与基准测试。给定自然语言任务描述后,AffordSim合成与任务相关的场景,发出可操作性查询,将其锚定至物体表面,采样条件化区域的抓取,并通过运动规划选择可执行候选方案。该方法进一步随机化物体姿态、纹理、光照、图像噪声及跨视角背景,以支持仿真到现实的迁移。我们将AffordSim实例化为涵盖50个多样化操作任务、五种机器人形态及500余个刚体与铰接物体的基准测试。在可操作性关键任务上,AffordSim的轨迹采集成功率可达手动接触标注的93%,在复杂组合任务中可达89%。基于AffordSim数据训练的视觉-语言-动作策略可零次迁移至真实Franka FR3机器人,平均成功率达24%。