In recent years, great advances in pre-trained language models (PLMs) have sparked considerable research focus and achieved promising performance on the approach of dense passage retrieval, which aims at retrieving relative passages from massive corpus with given questions. However, most of existing datasets mainly benchmark the models with factoid queries of general commonsense, while specialised fields such as finance and economics remain unexplored due to the deficiency of large-scale and high-quality datasets with expert annotations. In this work, we propose a new task, policy retrieval, by introducing the Chinese Stock Policy Retrieval Dataset (CSPRD), which provides 700+ prospectus passages labeled by experienced experts with relevant articles from 10k+ entries in our collected Chinese policy corpus. Experiments on lexical, embedding and fine-tuned bi-encoder models show the effectiveness of our proposed CSPRD yet also suggests ample potential for improvement. Our best performing baseline achieves 56.1% MRR@10, 28.5% NDCG@10, 37.5% Recall@10 and 80.6% Precision@10 on dev set.
翻译:近年来,预训练语言模型的重大进展推动了密集段落检索方法的研究热潮,该方法旨在根据给定问题从海量语料库中检索相关段落,并取得了显著性能。然而,现有数据集主要针对通用常识的事实型查询进行模型基准测试,而金融、经济等专业领域因缺乏大规模、高质量且经专家标注的数据集而尚未得到充分探索。本研究通过引入中国股票政策检索数据集(CSPRD)提出一项新任务——政策检索,该数据集提供700余份经资深专家标注的招股说明书段落,其相关条款来自我们收集的含10,000余条目的中文政策语料库。基于词汇匹配、嵌入向量及微调双编码器模型的实验表明,所提出的CSPRD具备有效性,但同时也显示出巨大的改进潜力。我们的最佳基线模型在开发集上实现了56.1%的MRR@10、28.5%的NDCG@10、37.5%的Recall@10及80.6%的Precision@10。