The reasoning capabilities of Large Language Models (LLMs) are increasingly attributed to training data quality rather than mere parameter scaling. However, existing data-centric paradigms often equate quality with factuality or diversity and ignore the internal logical complexity of training samples. In this work, we propose that natural language harbors Structured Logical Knowledge manifested through entailment relationships and logical topologies. To quantify this, we introduce Structured Logical Knowledge Density (SLKD), a novel metric that measures logical information content by decomposing natural language into executable predicates and logical primitives. Our analysis reveals a significant logical disparity in current datasets where sparse logical signals predominate. Consequently, we propose a density aware re-cognizing optimization strategy that prioritizes high-density logical samples to enhance with the LLM's reasoning ability. Extensive experiments demonstrate that our approach enhances reasoning performance and generalization without increasing total data volume. These results, further validated within a reinforcement learning framework, suggest that elevating logical density is more critical than expanding data scale for realizing the full cognitive potential of LLMs. The released code is available in the Appendix C.
翻译:大语言模型(LLM)的推理能力日益被认为源于训练数据质量,而非单纯参数规模的扩大。然而,现有以数据为中心的范式通常将质量等同于事实准确性或多样性,而忽视了训练样本内部的逻辑复杂性。本研究中,我们提出自然语言蕴含通过蕴涵关系和逻辑拓扑所呈现的结构化逻辑知识。为量化这一特性,我们引入了结构化逻辑知识密度(SLKD)这一新颖指标,通过将自然语言分解为可执行的谓词和逻辑原语来度量逻辑信息含量。我们的分析揭示了当前数据集中存在显著的逻辑不平衡性,其中稀疏的逻辑信号占据主导地位。为此,我们提出一种密度感知的再认知优化策略,优先选择高密度逻辑样本来增强LLM的推理能力。大量实验表明,该方法在不增加总体数据量的情况下,显著提升了模型的推理性能和泛化能力。这些结果在强化学习框架中进一步得到验证,表明提升逻辑密度对于实现LLM完整认知潜能而言,比扩大数据规模更为关键。相关代码已发布于附录C。