Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.
翻译:现实场景中广泛存在长序列数据,因此对其进行准确建模将开启众多下游应用场景。然而,深度神经网络因多种原因在此类任务中常面临挑战。近期在系统工程与模型设计方面的进展,使得宣称支持扩展上下文长度的模型得以规模化扩展。特别是状态空间模型与线性循环神经网络模型家族,理论上可扩展至无限序列长度。但这是否过于理想?我们通过评估表明,尽管这类主张在理论层面可能成立,实践中仍存在经验可观测的巨大差距。具体而言,循环模型在与采用注意力机制的长上下文大语言模型相同的场景中仍存在缺陷。我们进一步揭示,不同归纳偏置具有不一致的外推能力,这凸显了深入探究此类范式、并系统研究长上下文模型为何未能达到预期表现的必要性。