To enhance Large Language Model (LLM) capabilities, multi-agent debates have been introduced, where multiple LLMs discuss solutions to a problem over several rounds of debate. However, LLMs often produce incorrect responses that appear deceptively confident, which can mislead other agents. This is partly because agents do not express their confidence levels during standard debates. To address this, we introduce DebUnc, a multi-agent debate framework that uses uncertainty metrics to assess agent confidence levels. We adapted the LLM attention mechanism to adjust token weights based on confidence levels and also explored using textual prompts to convey confidence. Our evaluations across various benchmarks show that attention-based methods are particularly effective, and that as uncertainty metrics evolve, performance will continue to increase. The code is available at https://github.com/lukeyoffe/debunc
翻译:为增强大语言模型(LLM)的能力,多智能体辩论机制被引入,即多个LLM通过多轮辩论讨论问题解决方案。然而,LLM常会生成看似自信实则错误的响应,从而误导其他智能体。部分原因在于标准辩论过程中智能体未表达其置信度。为此,我们提出DebUnc——一种利用不确定性度量评估智能体置信度的多智能体辩论框架。我们改进了LLM注意力机制,使其能根据置信度调整词元权重,同时探索了使用文本提示传递置信度的方法。在多个基准测试上的评估表明,基于注意力的方法尤为有效,且随着不确定性度量的发展,性能将持续提升。代码发布于https://github.com/lukeyoffe/debunc。