We provide concrete evidence for memory management in a 4-layer transformer. Specifically, we identify clean-up behavior, in which model components consistently remove the output of preceeding components during a forward pass. Our findings suggest that the interpretability technique Direct Logit Attribution provides misleading results. We show explicit examples where this technique is inaccurate, as it does not account for clean-up behavior.
翻译:我们为4层Transformer中的内存管理提供了具体证据。具体而言,我们识别出一种清理行为,即模型组件在前向传播过程中持续移除前一组件的输出。我们的发现表明,可解释性技术“直接logit归因”可能产生误导性结果。我们展示了该技术不准确的显式实例,原因在于其未考虑清理行为。