Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability, and directional conditional entropy. We analyze the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous. Our code and checkpoints are released at https://github.com/apple/ml-reversal-blessing.
翻译:语言模型通常采用自左向右(L2R)的自回归分解方式。然而,L2R分解可能并非总是最佳的归纳偏置。因此,我们探究在某些任务中,文本分布的替代分解方式是否可能带来益处。我们将自右向左(R2L)训练作为一种引人注目的替代方案进行研究,并以多项选择题(MCQs)作为知识提取与推理的测试平台。通过在不同模型规模(2B-8B参数)和训练数据集上进行大量实验,我们发现R2L模型在多项MCQ基准测试(包括逻辑推理、常识理解和真实性评估任务)中能显著优于L2R模型。我们的分析表明,这种性能差异可能根本上与多种因素相关,包括校准性、可计算性和方向条件熵。我们通过使用算术任务进行受控模拟研究来分析这些因素的影响,在该类任务中这些影响因素能够被更好地分离。我们的工作表明,探索文本分布的替代分解方式能够提升LLM的能力,并为逼近人类语言分布的最优分解提供了理论见解,同时阐明了每种推理顺序在何种情况下可能更具优势。我们的代码与检查点发布于 https://github.com/apple/ml-reversal-blessing。