Noise-Robust De-Duplication at Scale

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications.

翻译：在大规模含噪文本语料中识别近似重复内容具有众多应用，包括训练数据集去重、降低隐私风险、评估测试集泄露，以及识别大型语料库中转载的新闻文章和文献。尽管应用场景多样，绝大多数研究仍依赖于N-gram方法。现有研究对N-gram方法性能的评估十分有限，部分原因在于如何为海量语料构建无偏评估数据集尚不明确。本研究利用历史新闻电讯的独特时效性，构建了包含27,210篇文档（含122,876个正例重复对）的数据集，用于研究噪声鲁棒去重问题。新闻的时效敏感性使得人工标注在庞大语料规模下仍具可行性——因为重复内容通常出现在较窄的日期范围内。研究进而开发并评估了多种去重方法：哈希方法、N-gram重叠方法（文献中的主流方法）、对比训练双编码器方法，以及结合双编码器与交叉编码器的重排序策略。神经方法在性能上显著优于哈希和N-gram重叠方法。实验表明，双编码器具有良好的可扩展性，可在单块GPU卡上数小时内完成千万级文档语料的去重。我们还将其预训练模型应用于C4（大规模干净爬取语料库）中的RealNews和专利子集，证明在各类噪声环境下，神经方法能识别出哈希方法遗漏的大量近似重复内容。我们的NEWS-COPY去重数据集、代码库和预训练模型将面向公众开放，以促进相关研究与应用。