Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.
翻译:摘要:尽管语言模型(LMs)在编码能力上取得了显著进展,但在生成平衡括号等简单句法任务中仍存在困难。本研究探究了不同规模(124M-7B)语言模型中这些错误持续存在的潜在机制,旨在理解并减少此类错误。我们的研究表明,语言模型依赖多个独立做出预测的组件(注意力头和前馈神经元)。其中部分组件能在广泛输入范围内可靠地促进正确答案(即实现“正确机制”),而另一些组件则可靠性较低,通过促进错误令牌引入噪声(即实现“错误机制”)。当错误机制压倒正确机制并主导预测时,错误便会产生。基于这一发现,我们提出了一种名为RASteer的引导方法,用以系统性地识别并增强可靠组件的贡献,从而提升模型性能。RASteer显著改善了平衡括号任务的表现,将部分模型的准确率从0%提升至约100%,且不影响模型的通用编码能力。我们进一步展示了该方法在算术推理任务中的广泛适用性,性能提升最高可达约20%。