Robot decision-making increasingly relies on data-driven human prediction models when operating around people. While these models are known to mispredict in out-of-distribution interactions, only a subset of prediction errors impact downstream robot performance. We propose characterizing such "system-level" prediction failures via the mathematical notion of regret: high-regret interactions are precisely those in which mispredictions degraded closed-loop robot performance. We further introduce a probabilistic generalization of regret that calibrates failure detection across disparate deployment contexts and renders regret compatible with reward-based and reward-free (e.g., generative) planners. In simulated autonomous driving interactions and social navigation interactions deployed on hardware, we showcase that our system-level failure metric can be used offline to automatically extract closed-loop human-robot interactions that state-of-the-art generative human predictors and robot planners previously struggled with. We further find that the very presence of high-regret data during human predictor fine-tuning is highly predictive of robot re-deployment performance improvements. Fine-tuning with the informative but significantly smaller high-regret data (23% of deployment data) is competitive with fine-tuning on the full deployment dataset, indicating a promising avenue for efficiently mitigating system-level human-robot interaction failures. Project website: https://cmu-intentlab.github.io/not-all-errors/
翻译:当机器人在人类周围操作时,其决策过程日益依赖于数据驱动的人类行为预测模型。尽管已知这些模型在分布外交互中会出现预测错误,但仅有部分预测误差会影响下游的机器人性能。我们提出通过数学上的遗憾概念来刻画此类“系统级”预测失败:高遗憾交互正是那些因错误预测导致闭环机器人性能下降的交互场景。我们进一步引入了遗憾的概率化泛化形式,该形式能在不同部署情境中校准失败检测,并使遗憾度量与基于奖励的规划器及无奖励规划器(例如生成式规划器)兼容。在模拟自动驾驶交互及硬件部署的社交导航交互实验中,我们证明所提出的系统级失败度量可用于离线自动提取那些当前最先进的生成式人类预测器与机器人规划器曾难以处理的闭环人机交互数据。我们还发现,在人类预测器微调过程中,高遗憾数据的存在本身就能高度预示机器人重新部署后的性能提升。使用信息量丰富但规模显著更小的高遗憾数据(占部署数据的23%)进行微调,其效果可与使用完整部署数据集微调相媲美,这为高效缓解系统级人机交互失败提供了一条前景广阔的途径。项目网站:https://cmu-intentlab.github.io/not-all-errors/