S-EQA: Tackling Situational Queries in Embodied Question Answering

We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.

翻译：我们提出并解决了家庭环境中带有情境查询的具身问答（EQA）问题。与以往处理直接引用目标对象及其可量化属性的简单查询的EQA工作不同，带有情境查询（如“浴室是否干净干燥？”）的EQA更具挑战性，因为智能体不仅需要找出查询涉及的目标对象，还需要就这些对象的状态达成共识才能回答。为此，我们首先提出了一种新颖的提示-生成-评估（PGE）方案，该方案围绕大语言模型（LLM）的输出构建数据集，包含独特的情境查询、对应的共识对象信息及预测答案。PGE通过多种形式的语义相似性保持生成查询的唯一性。我们通过在亚马逊土耳其机器人（M-Turk）上进行的大规模用户研究验证了生成的数据集，并将其作为S-EQA推出——这是首个处理情境查询的EQA数据集。用户研究证明了S-EQA的真实性，在给定共识对象数据的情况下，97.26%的生成查询被认为可回答。相反，我们发现LLM预测答案与人类评估答案之间的相关性较低（46.2%），表明LLM直接回答情境查询的能力不足，同时确立了S-EQA在提供人类验证的共识以实现间接解决方案方面的实用性。我们通过在VirtualHome上进行视觉问答（VQA）评估S-EQA，与其他模拟器不同，VirtualHome包含多个具有可修改状态且修改后外观视觉上不同的对象，从而为S-EQA设立了定量基准。据我们所知，这是首个引入带有情境查询的EQA的工作，也是首个使用生成方法创建查询的工作。