Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements impacting every scientific field and discipline. They are trained on a simple objective: to predict the next token given the previous context. We live in a world where most of the data around us, e.g., text, audio, and music, has a multi-scale structure associated with it. This paper infuses LLMs with traditional signal processing ideas, namely wavelets, during pre-training to take advantage of the structure. Without adding \textbf{any extra parameters} to a GPT-style LLM architecture, we achieve the same pre-training performance almost twice as fast in text, raw audio, and symbolic music. This is achieved by imposing a structure on intermediate embeddings. When trained for the same number of training steps, we achieve significant gains in performance, which is comparable to pre-training a larger neural architecture. Our architecture allows every next token prediction access to intermediate embeddings at different temporal resolutions in every Transformer decoder block. This work will hopefully pave the way for incorporating multi-rate signal processing ideas into traditional LLM pre-training. Further, we showcase pushing model performance by improving internal structure instead of just going after scale.
翻译:大型语言模型(LLM)已引领人工智能的新一轮进步,影响着各个科学领域与学科。其训练基于一个简单的目标:根据已有上下文预测下一个标记。我们身处一个数据普遍具有多尺度结构的世界,例如文本、音频和音乐等数据皆然。本文在预训练阶段将传统信号处理思想——即小波变换——融入LLM,以利用这种结构特性。在GPT架构的LLM中**不增加任何额外参数**的情况下,我们在文本、原始音频和符号音乐的预训练中实现了近两倍的加速,同时保持同等性能。这是通过对中间嵌入施加结构约束实现的。当训练步数相同时,我们的方法能显著提升模型性能,其效果可与训练更大规模神经网络架构相媲美。我们的架构使每个Transformer解码器块中的下一个标记预测都能访问不同时间分辨率的中间嵌入。这项工作有望为将多速率信号处理思想融入传统LLM预训练开辟道路。此外,我们展示了通过优化内部结构而非单纯扩大规模来提升模型性能的可行路径。