WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss, and Knowledge Distillation with MarginMSE loss. WebFAQ 2.0 is not a static resource but part of a long-term effort. Since late 2025, structured FAQs are being regularly released through the Open Web Index, enabling continuous expansion and refinement. We publish the datasets and training scripts to facilitate further research in multilingual and cross-lingual IR. The dataset itself and all related resources are publicly available on GitHub and HuggingFace.

翻译：我们介绍了WebFAQ 2.0，这是WebFAQ数据集的一个新版本，包含涵盖108种语言的1.98亿个基于常见问答（FAQ）的自然问题-答案对。与前一版本相比，它显著扩展了多语言覆盖范围，并将双语对齐的问答对数量增加到超过1430万，使其成为最大的基于FAQ的资源。与原始版本不同，WebFAQ 2.0采用了一种新颖的数据收集策略，直接爬取并提取相关的网页内容，从而通过页面标题和描述获得了上下文更丰富、多样性更强、多语言程度更高的数据集。根据社区反馈，我们还发布了一个用于训练稠密检索器的困难负样本数据集，包含20种语言的125万个查询。这些困难负样本是通过一个两阶段检索流程挖掘得到的，每个查询包含200个负样本及其交叉编码器（cross-encoder）评分。我们进一步展示了该资源如何支持稠密检索器的两种主要微调策略：使用MultipleNegativesRanking损失的对比学习，以及使用MarginMSE损失的知识蒸馏。WebFAQ 2.0不是一个静态资源，而是一项长期努力的一部分。自2025年底以来，结构化的FAQ通过开放网络索引（Open Web Index）定期发布，实现了持续的扩展和优化。我们公开了数据集和训练脚本，以促进多语言和跨语言信息检索的进一步研究。数据集本身及所有相关资源均在GitHub和HuggingFace上公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。