Robot Foundation Models (RFMs) represent a promising approach to developing general-purpose home robots. Given the broad capabilities of RFMs, users will inevitably ask an RFM-based robot to perform tasks that the RFM was not trained or evaluated on. In these cases, it is crucial that users understand the risks associated with attempting novel tasks due to the relatively high cost of failure. Furthermore, an informed user who understands an RFM's capabilities will know what situations and tasks the robot can handle. In this paper, we study how non-roboticists interpret performance information from RFM evaluations. These evaluations typically report task success rate (TSR) as the primary performance metric. While TSR is intuitive to experts, it is necessary to validate whether novices also use this information as intended. Toward this end, we conducted a study in which users saw real evaluation data, including TSR, failure case descriptions, and videos from multiple published RFM research projects. The results highlight that non-experts not only use TSR in a manner consistent with expert expectations but also highly value other information types, such as failure cases that are not often reported in RFM evaluations. Furthermore, we find that users want access to both real data from previous evaluations of the RFM and estimates from the robot about how well it will do on a novel task.
翻译:机器人基础模型(RFMs)代表了开发通用家用机器人的一种前景广阔的方法。鉴于RFMs的广泛能力,用户不可避免地会要求基于RFM的机器人执行该模型未经训练或评估的任务。在这些情况下,由于失败的相对成本较高,用户理解尝试新任务所伴随的风险至关重要。此外,了解RFM能力的知情用户将清楚机器人能够处理何种情境和任务。本文研究了非机器人领域专家如何解读RFM评估中的性能信息。这些评估通常以任务成功率(TSR)作为主要性能指标。虽然TSR对专家而言直观易懂,但有必要验证新手是否也能按预期使用该信息。为此,我们开展了一项研究,让用户查看来自多个已发表RFM研究项目的真实评估数据,包括TSR、失败案例描述及视频。结果表明,非专家不仅以符合专家预期的方式使用TSR,而且高度重视其他类型的信息,例如在RFM评估中较少报告的失败案例。此外,我们发现用户既希望获取RFM过往评估的真实数据,也期望机器人能提供其在新任务上表现如何的预估。