Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper presents an in-depth analysis of a one-layer Transformer model trained for n-digit integer addition. We reveal that the model divides the task into parallel, digit-specific streams and employs distinct algorithms for different digit positions. Our study also finds that the model starts calculations late but executes them rapidly. A rare use case with high loss is identified and explained. Overall, the model's algorithm is explained in detail. These findings are validated through rigorous testing and mathematical modeling, contributing to the broader works in Mechanistic Interpretability, AI safety, and alignment. Our approach opens the door for analyzing more complex tasks and multi-layer Transformer models.
翻译:深入理解Transformer等机器学习模型的内在机制对于其安全与合乎伦理的应用至关重要。本文对训练用于n位整数加法的一层Transformer模型进行了深度分析。研究发现,该模型将加法任务分解为并行的、按位处理的独立流,并针对不同数位采用了差异化的算法。本研究还发现,模型的计算启动时机较晚,但执行速度较快。我们识别并解释了一个罕见的高损失异常用例。总体而言,本文详细阐释了该模型的算法原理。上述发现经过严格测试与数学建模验证,为机制可解释性、人工智能安全及价值对齐等更广泛领域的研究做出了贡献。我们的研究方法为分析更复杂任务与多层Transformer模型开辟了路径。