A substantial gap persists in understanding the reasons behind the exceptional performance of the Transformer architecture in NLP. A particularly unexplored area involves the mechanistic description of how the distribution of parameters evolves over time during training. In this work we suggest that looking at the time evolution of the statistic distribution of model parameters, and specifically at bifurcation effects, can help understanding the model quality, potentially reducing training costs and evaluation efforts and empirically showing the reasons behind the effectiveness of weights sparsification.
翻译:在理解Transformer架构在自然语言处理中卓越表现的原因方面,仍存在显著差距。一个尤为未充分探索的领域涉及训练过程中参数分布随时间演化的机制性描述。本研究表明,关注模型参数统计分布的时间演化,特别是分岔效应,有助于理解模型质量,从而可能降低训练成本与评估工作量,并实证揭示权重稀疏化有效性的内在原因。