Large language models (LLMs) showcase many desirable traits for intelligent and helpful robots. However, they are also known to hallucinate predictions. This issue is exacerbated in consumer robotics where LLM hallucinations may result in robots confidently executing plans that are contrary to user goals, relying more frequently on human assistance, or preventing the robot from asking for help at all. In this work, we present LAP, a novel approach for utilizing off-the-shelf LLM's, alongside scene and object Affordances, in robotic Planners that minimize harmful hallucinations and know when to ask for help. Our key finding is that calculating and leveraging a scene affordance score, a measure of whether a given action is possible in the provided scene, helps to mitigate hallucinations in LLM predictions and better align the LLM's confidence measure with the probability of success. We specifically propose and test three different affordance scores, which can be used independently or in tandem to improve performance across different use cases. The most successful of these individual scores involves prompting an LLM to determine if a given action is possible and safe in the given scene and uses the LLM's response to compute the score. Through experiments in both simulation and the real world, on tasks with a variety of ambiguities, we show that LAP significantly increases success rate and decreases the amount of human intervention required relative to prior art. For example, in our real-world testing paradigm, LAP decreases the human help rate of previous methods by over 33% at a success rate of 70%.
翻译:大语言模型(LLM)在赋能智能且乐于助人的机器人方面展现出诸多令人向往的特性。然而,它们也以产生预测幻觉而闻名。这一问题在消费级机器人领域尤为突出:LLM的幻觉可能导致机器人自信地执行与用户目标相悖的计划、更频繁地依赖人类协助,甚至完全阻止机器人主动寻求帮助。在本研究中,我们提出LAP——一种新颖方法,旨在利用现成的LLM以及场景与物体的可负担性,构建能最小化有害幻觉并知道何时求助的机器人规划器。我们的关键发现是:计算并利用场景可负担性得分(即衡量给定动作在所处场景中是否可行的指标),有助于缓解LLM预测中的幻觉,并使LLM的置信度与成功概率更精准对齐。我们具体提出并测试了三种不同的可负担性得分,这些得分可独立或联合使用,以在不同用例中提升性能。其中效果最优的单一得分通过提示LLM来判断给定动作在场景中是否可行且安全,并利用LLM的响应计算得分。通过在模拟环境与真实世界中对各类含歧义任务开展的实验,我们证明:相较于现有技术,LAP显著提高了成功率,并减少了所需的人类干预次数。例如,在真实世界测试范式中,当成功率达到70%时,LAP将先前方法的人类求助率降低了超过33%。