We present a low-cost retrieval system for the WSDM Cup 2026 multilingual retrieval task, where English queries are used to retrieve relevant documents from a collection of approximately ten million news articles in Chinese, Persian, and Russian, and to output the top-1000 ranked results for each query. We follow a four-stage pipeline that combines LLM-based GRF-style query expansion with BM25 candidate retrieval, dense ranking using long-text representations from jina-embeddings-v4, and pointwise re-ranking of the top-20 candidates using Qwen3-Reranker-4B while preserving the dense order for the remaining results. On the official evaluation, the system achieves nDCG@20 of 0.403 and Judged@20 of 0.95. We further conduct extensive ablation experiments to quantify the contribution of each stage and to analyze the effectiveness of query expansion, dense ranking, and top-$k$ reranking under limited compute budgets.
翻译:我们为WSDM Cup 2026多语言检索任务提出了一种低成本检索系统。该任务要求使用英文查询,从包含约一千万篇中文、波斯语和俄语新闻文章的语料库中检索相关文档,并为每个查询输出排名前1000的结果。我们采用了一个四阶段流水线:首先结合基于LLM的GRF式查询扩展与BM25候选检索;其次使用jina-embeddings-v4的长文本表示进行稠密排序;然后对前20个候选结果使用Qwen3-Reranker-4B进行逐点重排序,同时保持其余结果的稠密排序顺序。在官方评估中,该系统取得了nDCG@20为0.403、Judged@20为0.95的成绩。我们进一步进行了广泛的消融实验,以量化每个阶段的贡献,并分析了在有限计算预算下查询扩展、稠密排序和前$k$重排序的有效性。