Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.
翻译:许多关键的操作任务——例如食品制备、外科手术和手工艺品制作——对于自主机器人而言仍然难以实现。这些任务不仅具有接触密集、力敏感的动态特性,其成功标准也具有“隐含性”:与抓放任务不同,这些领域的任务质量是连续且主观的(例如土豆削皮的质量),这使得定量评估和奖励函数设计变得困难。我们提出一个针对此类任务的学习框架,以用刀削皮作为代表性示例。我们的方法采用两阶段流程:首先,通过力感知数据收集和模仿学习,我们学习一个鲁棒的初始策略,使其能够泛化到不同的物体变体;其次,我们通过基于偏好的微调来优化策略,该过程使用一个结合了定量任务指标与定性人类反馈的奖励模型,从而使策略行为与人类对任务质量的理解保持一致。仅使用50-200条削皮轨迹,我们的系统在包括黄瓜、苹果和土豆在内的具有挑战性的农产品上实现了超过90%的平均成功率,并且通过基于偏好的微调,性能提升了高达40%。值得注意的是,在单一农产品类别上训练的策略,对未见过的同类实例以及来自不同类别的分布外农产品,都表现出强大的零样本泛化能力,同时保持超过90%的成功率。