Effective medical text retrieval requires both high accuracy and low latency. While LLM-based embedding models possess powerful retrieval capabilities, their prohibitive latency and high computational cost limit their application in real-time scenarios. Furthermore, the lack of comprehensive and high-fidelity benchmarks hinders progress in Chinese medical text retrieval. In this work, we introduce the Chinese Medical Text Embedding Benchmark (CMedTEB), a benchmark spanning three kinds of practical embedding tasks: retrieval, reranking, and semantic textual similarity (STS). Distinct from purely automated datasets, CMedTEB is curated via a rigorous multi-LLM voting pipeline validated by clinical experts, ensuring gold-standard label quality while effectively mitigating annotation noise. On this foundation, we propose the Chinese Medical Asymmetric REtriever (CARE), an asymmetric architecture that pairs a lightweight BERT-style encoder for online query encoding with a powerful LLM-based encoder for offline document encoding. However, optimizing such an asymmetric retriever with two structurally different encoders presents distinctive challenges. To address this, we introduce a novel two-stage training strategy that progressively bridges the query and document representations. Extensive experiments demonstrate that CARE surpasses state-of-the-art symmetric models on CMedTEB, achieving superior retrieval performance without increasing inference latency.
翻译:有效的医学文本检索需要兼具高精度和低延迟。尽管基于大语言模型(LLM)的嵌入模型具备强大的检索能力,但其高昂的延迟和计算成本限制了其在实时场景中的应用。此外,缺乏全面且高保真的基准评测也阻碍了中文医学文本检索的进步。本文提出中文医学文本嵌入基准(CMedTEB),该基准涵盖三类实用嵌入任务:检索、重排序和语义文本相似度(STS)。与纯自动化数据集不同,CMedTEB通过严格的、经临床专家验证的多LLM投票流程构建,在有效抑制标注噪声的同时确保黄金标准的标签质量。在此基础上,我们提出中文医学非对称检索器(CARE),这是一种非对称架构,将轻量级BERT风格编码器用于在线查询编码,与强大的LLM编码器用于离线文档编码配对。然而,优化这种包含两个结构不同编码器的非对称检索器面临独特挑战。为此,我们引入一种新颖的两阶段训练策略,逐步弥合查询与文档表示之间的差异。大量实验表明,CARE在CMedTEB上超越了最先进的对称模型,在推理延迟不增加的情况下实现了更优的检索性能。