Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model's algebraic reasoning capacity.

翻译：代数推理仍然是大型语言模型最具信息量的压力测试之一，但当前的基准测试无法为失败归因于特定原因提供机制。当模型在代数问题上失败时，单一的准确率分数无法揭示是表达式嵌套过深、运算符过于罕见、中间状态数量过多还是依赖链过长所致。以往研究仅孤立地考察个别失败模式，但尚无框架能在严格实验控制下独立改变各复杂度因素。既往系统也无法自动生成并验证复杂度递增的问题以追踪模型能力的演变。我们提出一个九维代数复杂度框架，每个维度独立变化而其他维度保持固定，问题生成与验证由无需人工标注的参数化流水线完成。每个维度均基于已记录的LLM失败模式，并捕获代数难度的结构性不同方面，包括表达式嵌套深度、并行中间结果数量、子表达式复杂度、运算符难度以及依赖推理链长度。我们评估了七款指令调优模型（参数量从8B到235B）在所有九个维度上的表现，发现工作记忆是主导性的尺度不变瓶颈。所有模型均在20到30个并行分支处崩溃，与参数数量无关，这指向了硬性架构约束而非可解决的容量限制。我们的分析进一步确定了五个维度的最小诊断充分子集，它们共同覆盖了所有已记录的代数失败模式空间，从而提供了模型代数推理能力的完整复杂度画像。