Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism\, -- \,attention requires $O(n^2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine Mamba's efficacy through the lens of a classical IR task\, -- \,document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that \textbf{(1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention.} We hope this study can serve as a starting point to explore \mamba models in other classical IR tasks. Our \href{https://github.com/zhichaoxu-shufe/RankMamba}{code implementation} is made public to facilitate reproducibility. Refer to~\cite{xu-etal-2025-state} for more comprehensive experiments and results, including passage ranking.
翻译:Transformer架构在多个应用机器学习领域取得了巨大成功,例如自然语言处理(NLP)、计算机视觉(CV)和信息检索(IR)。Transformer架构的核心机制——注意力机制,在训练时需要$O(n^2)$的时间复杂度,在推理时需要$O(n)$的时间复杂度。已有许多工作致力于改进注意力机制的可扩展性,例如Flash Attention和Multi-query Attention。另一类研究工作则旨在设计新的机制来替代注意力。最近,一种基于状态空间模型的显著模型结构Mamba,在多项序列建模任务中取得了与Transformer相当的性能。在本工作中,我们通过一个经典的IR任务——文档排序,来检验Mamba的有效性。重排序模型以查询和文档作为输入,并预测一个标量相关性分数。该任务要求语言模型具备理解长上下文输入以及捕捉查询与文档词元之间交互的能力。我们发现:\textbf{(1)在采用相同训练方案的情况下,Mamba模型达到了与基于Transformer的模型相竞争的性能;(2)但与高效的Transformer实现(如Flash Attention)相比,其训练吞吐量较低。}我们希望本研究能够作为探索Mamba模型在其他经典IR任务中应用的起点。我们的\href{https://github.com/zhichaoxu-shufe/RankMamba}{代码实现}已公开,以促进可复现性。更全面的实验和结果(包括段落排序)请参见~\cite{xu-etal-2025-state}。