Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments of tokens. Nevertheless, natural units of meaning, such as words or phrases, display varying sizes. To address this mismatch, we equip language models with a dynamic-pooling mechanism, which predicts segment boundaries in an autoregressive fashion. We compare several methods to infer boundaries, including end-to-end learning through stochastic re-parameterisation, supervised learning (based on segmentations from subword tokenizers or spikes in conditional entropy), as well as linguistically motivated boundaries. We perform character-level evaluation on texts from multiple datasets and morphologically diverse languages. The results demonstrate that dynamic pooling, which jointly segments and models language, is both faster and more accurate than vanilla Transformers and fixed-length pooling within the same computational budget.
翻译:Transformer在语言建模方面表现出无与伦比的性能,但在内存和时间复杂度方面仍然效率低下。一种可能的改进方法是通过对固定长度的令牌段进行池化,以减少中间层的序列长度。然而,自然的语义单元(如单词或短语)的尺度各不相同。为解决这一不匹配问题,我们为语言模型配备了一种动态池化机制,该机制以自回归方式预测段的边界。我们比较了多种边界推断方法,包括通过随机重参数化的端到端学习、监督学习(基于子词分词器或条件熵尖峰的切分)以及语言学驱动的边界。我们在多个数据集和形态多样的语言上进行了字符级评估。结果表明,在相同计算预算下,动态池化(同时进行序列切分与语言建模)比标准Transformer和固定长度池化更快且更准确。