Large language models (LLMs) face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by DenseRetNet outperforming the original RetNet with up to 5% accuracy improvement on public benchmarks. code is avalaible at https://github.com/WailordHe/DenseSSM
翻译:大型语言模型(LLM)因广泛使用的Transformer架构带来的过高计算和内存需求而面临严峻挑战。尽管状态空间模型(SSM)作为一种新型的基础网络架构,具有较低的计算复杂度,但其性能尚未完全媲美Transformer。本文提出DenseSSM,一种增强SSM层间隐藏信息流动的新方法。通过选择性将浅层隐藏状态集成到深层,DenseSSM保留了对最终输出至关重要的细粒度信息。增强的密集连接使得DenseSSM仍能保持训练并行性和推理效率。所提方法可广泛适用于各种SSM类型,如RetNet和Mamba。在相似模型规模下,DenseSSM实现了显著改进,例如DenseRetNet在公共基准测试上相比原始RetNet准确性提升高达5%。代码可在https://github.com/WailordHe/DenseSSM获取。