The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.
翻译:Transformer架构对众多人工智能模型至关重要,但在长程语言建模中仍面临挑战。尽管已有几种特定Transformer架构被设计用于解决长程依赖问题,但现有方法如Transformer-XL存在大量无效记忆的问题。本研究提出一种即插即用策略——免训练记忆选择(TRAMS),该策略基于单一度量指标选择参与注意力计算的词元。此策略允许我们保留可能与当前查询产生高注意力分数的词元,同时忽略其他词元。我们在词级基准测试(WikiText-103)和字符级基准测试(enwik8)上验证了该方法,结果表明无需额外训练或增加参数即可提升性能。