Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper presents an in-depth analysis of a one-layer Transformer model trained for n-digit integer addition. We reveal that the model divides the task into parallel, digit-specific streams and employs distinct algorithms for different digit positions. Our study also finds that the model starts calculations late but executes them rapidly. A rare use case with high loss is identified and explained. Overall, the model's algorithm is explained in detail. These findings are validated through rigorous testing and mathematical modeling, contributing to the broader works in Mechanistic Interpretability, AI safety, and alignment. Our approach opens the door for analyzing more complex tasks and multi-layer Transformer models.
翻译:理解机器学习模型(如Transformer)的内部运作机制对于其安全与合乎伦理的使用至关重要。本文对训练于n位整数加法任务的单层Transformer模型进行了深入分析。我们发现,该模型将任务分解为并行、逐数字的处理流,并对不同数位采用不同的算法。研究还表明,模型开始计算的时间较晚,但执行速度极快。我们识别并解释了一个罕见的高损失用例。总体而言,本文详细阐述了模型的算法。这些发现通过严谨的测试与数学建模得到验证,为机械可解释性、AI安全与对齐等更广泛领域的研究做出了贡献。我们的方法为分析更复杂的任务及多层Transformer模型开辟了道路。