Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

This work compares large language models (LLMs) and neuro-symbolic approaches in solving Raven's progressive matrices (RPM), a visual abstract reasoning test that involves the understanding of mathematical rules such as progression or arithmetic addition. Providing the visual attributes directly as textual prompts, which assumes an oracle visual perception module, allows us to measure the model's abstract reasoning capability in isolation. Despite providing such compositionally structured representations from the oracle visual perception and advanced prompting techniques, both GPT-4 and Llama-3 70B cannot achieve perfect accuracy on the center constellation of the I-RAVEN dataset. Our analysis reveals that the root cause lies in the LLM's weakness in understanding and executing arithmetic rules. As a potential remedy, we analyze the Abductive Rule Learner with Context-awareness (ARLC), a neuro-symbolic approach that learns to reason with vector-symbolic architectures (VSAs). Here, concepts are represented with distributed vectors s.t. dot products between encoded vectors define a similarity kernel, and simple element-wise operations on the vectors perform addition/subtraction on the encoded values. We find that ARLC achieves almost perfect accuracy on the center constellation of I-RAVEN, demonstrating a high fidelity in arithmetic rules. To stress the length generalization capabilities of the models, we extend the RPM tests to larger matrices (3x10 instead of typical 3x3) and larger dynamic ranges of the attribute values (from 10 up to 1000). We find that the LLM's accuracy of solving arithmetic rules drops to sub-10%, especially as the dynamic range expands, while ARLC can maintain a high accuracy due to emulating symbolic computations on top of properly distributed representations. Our code is available at https://github.com/IBM/raven-large-language-models.

翻译：本研究比较了大型语言模型（LLMs）与神经符号方法在解决瑞文渐进矩阵（RPM）任务上的表现。RPM是一种视觉抽象推理测试，涉及对数学规则（如递进或算术加法）的理解。通过将视觉属性直接作为文本提示输入（假设存在一个理想的视觉感知模块），我们得以单独评估模型的抽象推理能力。尽管提供了这种由理想视觉感知模块生成的结构化表征并采用了先进的提示技术，GPT-4和Llama-3 70B在I-RAVEN数据集的中心星座任务上仍无法达到完美准确率。分析表明，根本原因在于LLMs在理解和执行算术规则方面存在缺陷。作为潜在解决方案，我们分析了具有上下文感知能力的溯因规则学习器（ARLC）——一种基于向量符号架构（VSAs）进行推理学习的神经符号方法。该方法使用分布式向量表示概念，其中编码向量间的点积定义了相似性核，而对向量的简单逐元素运算则实现了编码值的加法/减法。我们发现ARLC在I-RAVEN中心星座任务上达到了接近完美的准确率，展现出对算术规则的高度保真性。为测试模型的长距离泛化能力，我们将RPM测试扩展到更大矩阵（3x10而非典型的3x3）和更广的属性值动态范围（从10到1000）。结果表明：LLMs解决算术规则的准确率降至10%以下，且随着动态范围扩大而持续下降；而ARLC由于能在适当分布式表征基础上模拟符号计算，仍能保持高准确率。代码发布于https://github.com/IBM/raven-large-language-models。