Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism -- attention requires $O(n^2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine \mamba's efficacy through the lens of a classical IR task -- document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that (1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention. We hope this study can serve as a starting point to explore Mamba models in other classical IR tasks. Our code implementation and trained checkpoints are made public to facilitate reproducibility (https://github.com/zhichaoxu-shufe/RankMamba).
翻译:Transformer结构在自然语言处理(NLP)、计算机视觉(CV)和信息检索(IR)等多个应用机器学习领域中取得了巨大成功。Transformer架构的核心机制——注意力机制在训练时需$O(n^2)$的时间复杂度,推理时为$O(n)$的时间复杂度。已有许多研究致力于改进注意力机制的可扩展性,例如Flash Attention和Multi-query Attention。另一条研究路线旨在设计替代注意力的新机制。近期,一种基于状态空间模型的显著模型结构Mamba在多个序列建模任务中取得了与Transformer相当的性能。本研究通过经典IR任务——文档排序的视角,检验Mamba的有效性。重排序模型以查询和文档作为输入,预测标量相关性分数。该任务要求语言模型具备理解长上下文输入以及捕捉查询与文档词元间交互的能力。我们发现:(1)采用相同训练方案的Mamba模型与基于Transformer的模型相比,能达到具有竞争力的性能;(2)但与Flash Attention等高效Transformer实现相比,其训练吞吐量较低。希望本研究能为在其他经典IR任务中探索Mamba模型提供起点。我们已公开代码实现和训练检查点,以促进可复现性(https://github.com/zhichaoxu-shufe/RankMamba)。