This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. These results highlight significant challenges in advancing LLMs' rule-guided reasoning capabilities in real-life applications.
翻译:本文介绍了RuleArena,这是一个新颖且具有挑战性的基准测试,旨在评估大语言模型(LLMs)在推理过程中遵循复杂现实世界规则的能力。该基准涵盖三个实际领域——航空行李托运费用、NBA交易规则与税收法规——通过评估LLMs处理复杂自然语言指令的熟练程度,这些指令要求模型具备长上下文理解、逻辑推理和精确数学计算能力。RuleArena区别于传统基于规则的推理基准的两个关键特征是:(1)其规则表示超越了标准的一阶逻辑形式;(2)其测试场景根植于真实、实际的现实情境,从而为评估LLMs在现实应用中的适用性与可靠性提供了深刻洞见。我们的研究发现LLMs存在若干显著局限:(1)模型难以识别并应用正确规则,经常被相似但不同的条款所混淆;(2)即使正确识别相关规则,也无法持续执行精确的数学计算;(3)总体而言,模型在该基准测试中表现不佳。这些结果凸显了在现实应用中提升LLMs规则引导推理能力所面临的重大挑战。