Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. Our work investigates the reproducibility of claims in their benchmark, and provides a deeper understanding of its usability and generalizability. We replicate the original experiments on additional models, and introduce additional metrics to verify negotiation quality and evenness of evaluation. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and thoroughness of the ablation study. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal insights that provide additional context to potential users. Our results highlight the importance of context in model-comparative evaluations.
翻译:大型语言模型(LLMs)在多智能体谈判任务中展现出巨大潜力,但由于缺乏稳健且可泛化的基准,该领域的评估仍具挑战性。Abdelnabi 等人(2024)提出了一种基于可计分博弈的谈判基准,旨在为LLMs开发一个高度复杂且现实的评估框架。我们的工作研究了其基准中各项主张的可复现性,并对其可用性与泛化性提供了更深入的理解。我们使用额外模型复现了原始实验,并引入了额外的指标以验证谈判质量和评估的公平性。我们的发现表明,尽管该基准确实复杂,但模型间的比较存在模糊性,这对其客观性提出了质疑。此外,我们发现了实验设置中的一些局限,特别是在信息泄露检测和消融研究的彻底性方面。通过在基准的扩展版本上检查和分析更广泛模型的行为,我们揭示了一些见解,为潜在用户提供了额外的背景信息。我们的结果凸显了在模型比较评估中背景信息的重要性。