Large Language Models (LLMs) have demonstrated impressive performance across a wide range of applications; however, assessing their reasoning capabilities remains a significant challenge. In this paper, we introduce a framework grounded in group and symmetry principles, which have played a crucial role in fields such as physics and mathematics, and offer another way to evaluate their capabilities. While the proposed framework is general, to showcase the benefits of employing these properties, we focus on arithmetic reasoning and investigate the performance of these models on four group properties: closure, identity, inverse, and associativity. Our findings reveal that LLMs studied in this work struggle to preserve group properties across different test regimes. In the closure test, we observe biases towards specific outputs and an abrupt degradation in their performance from 100% to 0% after a specific sequence length. They also perform poorly in the identity test, which represents adding irrelevant information in the context, and show sensitivity when subjected to inverse test, which examines the robustness of the model with respect to negation. In addition, we demonstrate that breaking down problems into smaller steps helps LLMs in the associativity test that we have conducted. To support these tests we have developed a synthetic dataset which will be released.
翻译:大语言模型(LLMs)已在广泛应用中展现出卓越性能,但评估其推理能力仍是一项重大挑战。本文引入一个基于群与对称性原理的框架——这些原理在物理学和数学等领域发挥着关键作用,并为评估其能力提供了另一种途径。尽管该框架具有通用性,但为展示运用这些特性的优势,我们聚焦算术推理任务,研究了模型在四个群性质(封闭性、单位元、逆元、结合律)上的表现。研究结果表明,本文所考察的LLMs在不同测试场景下难以保持群性质。在封闭性测试中,我们观察到模型对特定输出存在偏好,并在特定序列长度后出现性能从100%骤降至0%的突变。在单位元测试(代表在上下文中添加无关信息)中模型表现不佳,而在逆元测试(检验模型对否定操作的鲁棒性)中表现出敏感性。此外,我们证明将问题分解为更小步骤有助于模型通过所实施的结合律测试。为支撑这些测试,我们开发了一个合成数据集,该数据集将予以发布。