We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s
翻译:本文介绍BM25S,一种仅依赖Numpy和Scipy的高效Python版BM25实现。相较于最流行的Python框架,BM25S通过在索引阶段即时计算BM25分数并存储至稀疏矩阵,实现了高达500倍的加速。与流行商业产品采用的高度优化的Java实现相比,BM25S同样取得了显著的速度提升。此外,基于Kamphuis等人(2020)的研究,BM25S通过采用新颖的分数偏移方法将即时评分技术扩展至非稀疏变体,精确复现了五种BM25变体的实现。代码可在https://github.com/xhluca/bm25s 获取。