Consider a robot tasked with tidying a desk with a meticulously constructed Lego sports car. A human may recognize that it is not socially appropriate to disassemble the sports car and put it away as part of the "tidying". How can a robot reach that conclusion? Although large language models (LLMs) have recently been used to enable social reasoning, grounding this reasoning in the real world has been challenging. To reason in the real world, robots must go beyond passively querying LLMs and *actively gather information from the environment* that is required to make the right decision. For instance, after detecting that there is an occluded car, the robot may need to actively perceive the car to know whether it is an advanced model car made out of Legos or a toy car built by a toddler. We propose an approach that leverages an LLM and vision language model (VLM) to help a robot actively perceive its environment to perform grounded social reasoning. To evaluate our framework at scale, we release the MessySurfaces dataset which contains images of 70 real-world surfaces that need to be cleaned. We additionally illustrate our approach with a robot on 2 carefully designed surfaces. We find an average 12.9% improvement on the MessySurfaces benchmark and an average 15% improvement on the robot experiments over baselines that do not use active perception. The dataset, code, and videos of our approach can be found at https://minaek.github.io/groundedsocialreasoning.
翻译:考虑一个机器人任务,需整理一张放置了精心搭建的乐高跑车的桌子。人类可能意识到,拆解跑车并将其作为"整理"的一部分并不符合社会规范。机器人如何得出这一结论?尽管大语言模型(LLMs)近期已被用于实现社会推理,但将这种推理锚定于现实世界仍具挑战性。要在现实世界中推理,机器人必须超越被动查询LLMs,转而主动从环境中收集做出正确决策所需的信息。例如,在检测到存在被遮挡的车辆后,机器人可能需要主动感知该车辆,以判断其是由乐高搭建的高级模型车,还是幼童拼装的玩具车。我们提出了一种方法,利用LLM和视觉语言模型(VLM)帮助机器人主动感知环境,从而实现具身社会推理。为规模化评估我们的框架,我们发布了MessySurfaces数据集,包含70个需清理的真实世界表面图像。此外,我们通过两台精心设计的表面上的机器人实验展示了该方法。实验结果表明,与未使用主动感知的基线相比,我们的方法在MessySurfaces基准测试中平均提升12.9%,在机器人实验中平均提升15%。数据集、代码及演示视频可访问 https://minaek.github.io/groundedsocialreasoning 获取。