Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability and directional conditional entropy. We ablate the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous.
翻译:语言模型通常采用从左到右(L2R)的自回归分解方式。然而,L2R分解并非总是最优的归纳偏置。因此,我们探究在某些任务中,文本分布的其他分解方式是否可能带来益处。我们重点研究从右到左(R2L)训练作为一种引人注目的替代方案,并以多项选择题(MCQs)作为知识提取与推理能力的测试平台。通过在多种模型规模(2B-8B参数)和训练数据集上进行广泛实验,我们发现R2L模型在多个MCQ基准测试(包括逻辑推理、常识理解和真实性评估任务)中显著优于L2R模型。分析表明,这种性能差异可能从根本上与校准度、可计算性及方向条件熵等多重因素相关。我们通过算术任务的受控模拟研究对这些因素的影响进行消融分析,从而更清晰地解耦各影响因素。本研究表明,探索文本分布的替代分解方式能够提升大语言模型的能力,并为逼近人类语言分布的最优分解路径提供了理论洞见,同时揭示了不同推理方向在何种情境下更具优势。