In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.
翻译:在法律信息检索领域,基于嵌入的高效模型对于精确的问答系统至关重要。然而,大规模标注数据集的稀缺构成了重大挑战,尤其对于越南语法律文本而言。为解决此问题,我们提出一种新颖方法,利用大语言模型为越南语法律段落生成高质量、多样化的合成查询。随后,该合成数据被用于预训练检索模型,特别是双编码器模型和ColBERT,并进一步通过包含挖掘困难负样本的对比损失进行微调。我们的实验表明,这些增强措施显著提升了检索准确率,验证了合成数据与预训练技术在克服越南法律领域因缺乏大规模标注数据集所带来限制方面的有效性。