Central Answer Modeling for an Embodied Multi-LLM System

Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple large language models (LLM) based agents independently answering queries about a household environment. To generate one answer for each query, we use the individual responses to train a Central Answer Model (CAM) that aggregates responses for a robust answer. While prior Question Answering (QA) work has used a central module based on answers from multiple LLM-based experts, we specifically look at applying this framework to embodied LLM-based agents that must physically explore the environment first to become experts on their given environment to answer questions. Our work is the first to utilize a central answer model framework with embodied agents that must rely on exploring an unknown environment. We set up a variation of EQA where instead of the agents exploring the environment after the question is asked, the agents first explore the environment for a set amount of time and then answer a set of queries. Using CAM, we observe a $46\%$ higher EQA accuracy when compared against aggregation methods for ensemble LLM, such as voting schemes and debates. CAM does not require any form of agent communication, alleviating it from the associated costs. We ablate CAM with various nonlinear (neural network, random forest, decision tree, XGBoost) and linear (logistic regression classifier, SVM) algorithms. We experiment in various topological graph environments and examine the case where one of the agents is malicious and purposes contribute responses it believes to be wrong.

翻译：具身问答（EQA）是一个重要问题，涉及智能体通过探索环境来回答用户查询。现有文献中，EQA仅在单智能体场景下进行研究，其探索过程往往耗时且成本高昂。本研究将EQA置于多智能体框架中，该框架包含多个基于大语言模型（LLM）的智能体，它们独立回答关于家庭环境的查询。为每个查询生成一个答案时，我们利用个体响应训练一个中心答案模型（CAM），以聚合响应并产生鲁棒的答案。尽管先前问答（QA）研究已采用基于多个LLM专家答案的中心模块，但本研究特别关注将该框架应用于具身LLM智能体——这些智能体必须先通过物理探索环境成为特定环境的专家，才能回答问题。我们的工作是首个在必须依赖探索未知环境的具身智能体上应用中心答案模型框架的研究。我们建立了一种EQA变体：智能体不是在问题提出后探索环境，而是先进行固定时长的环境探索，再回答一组查询。使用CAM后，与集成LLM的聚合方法（如投票机制和辩论）相比，我们观察到EQA准确率提升了46%。CAM无需任何形式的智能体间通信，从而避免了相关成本。我们通过多种非线性（神经网络、随机森林、决策树、XGBoost）和线性（逻辑回归分类器、支持向量机）算法对CAM进行消融实验。我们在多种拓扑图环境中进行实验，并考察了其中一个智能体存在恶意且故意提供其认为错误响应的情况。