Many machine learning models use the manipulation of dimensions as a driving force to enable models to identify and learn important features in data. In the case of sequential data this manipulation usually happens on the token dimension level. Despite the fact that many tasks require a change in sequence length itself, the step of sequence length reduction usually happens out of necessity and in a single step. As far as we are aware, no model uses the sequence length reduction step as an additional opportunity to tune the models performance. In fact, sequence length manipulation as a whole seems to be an overlooked direction. In this study we introduce a novel attention-based method that allows for the direct manipulation of sequence lengths. To explore the method's capabilities, we employ it in an autoencoder model. The autoencoder reduces the input sequence to a smaller sequence in latent space. It then aims to reproduce the original sequence from this reduced form. In this setting, we explore the methods reduction performance for different input and latent sequence lengths. We are able to show that the autoencoder retains all the significant information when reducing the original sequence to half its original size. When reducing down to as low as a quarter of its original size, the autoencoder is still able to reproduce the original sequence with an accuracy of around 90%.
翻译:许多机器学习模型利用维度操作作为驱动力,使模型能够识别并学习数据中的重要特征。在处理序列数据时,这种操作通常发生在词元维度层面。尽管许多任务本身需要改变序列长度,但序列长度缩减的步骤往往出于必要性而以单步方式执行。据我们所知,尚无模型将序列长度缩减步骤作为额外机会来优化模型性能。事实上,序列长度操作整体似乎是常被忽视的研究方向。本研究提出了一种新颖的基于注意力的方法,可直接操纵序列长度。为探索该方法的能力,我们将其应用于自编码器模型。该自编码器将输入序列缩减为潜在空间中的更小序列,随后尝试从缩减后的形式重建原始序列。在此设定下,我们探究了该方法在不同输入与潜在序列长度下的缩减性能。结果表明:当原始序列缩减至原长一半时,自编码器能保留所有重要信息;即使缩减至原长四分之一,自编码器仍能以约90%的准确率重建原始序列。