The rapid advancement of Vision-Language Models (VLMs) has significantly advanced the development of Embodied Question Answering (EQA), enhancing agents' abilities in language understanding and reasoning within complex and realistic scenarios. However, EQA in real-world scenarios remains challenging, as human-posed questions often contain noise that can interfere with an agent's exploration and response, bringing challenges especially for language beginners and non-expert users. To address this, we introduce a NoisyEQA benchmark designed to evaluate an agent's ability to recognize and correct noisy questions. This benchmark introduces four common types of noise found in real-world applications: Latent Hallucination Noise, Memory Noise, Perception Noise, and Semantic Noise generated through an automated dataset creation framework. Additionally, we also propose a 'Self-Correction' prompting mechanism and a new evaluation metric to enhance and measure both noise detection capability and answer quality. Our comprehensive evaluation reveals that current EQA agents often struggle to detect noise in questions, leading to responses that frequently contain erroneous information. Through our Self-Correct Prompting mechanism, we can effectively improve the accuracy of agent answers.
翻译:视觉语言模型(VLMs)的快速发展显著推动了具身问答(EQA)领域的进步,增强了智能体在复杂现实场景中的语言理解与推理能力。然而,现实场景中的EQA仍面临挑战,因为人类提出的问题常包含噪声,可能干扰智能体的探索与应答,这对语言初学者和非专业用户尤为突出。为解决此问题,我们提出了NoisyEQA基准,旨在评估智能体识别与纠正噪声问题的能力。该基准通过自动化数据集构建框架引入了四类现实应用中常见的噪声:潜在幻觉噪声、记忆噪声、感知噪声与语义噪声。此外,我们还提出了一种“自校正”提示机制及新的评估指标,以增强并量化噪声检测能力与答案质量。综合评估表明,现有EQA智能体常难以识别问题中的噪声,导致应答频繁包含错误信息。通过我们的自校正提示机制,可有效提升智能体答案的准确性。