Understanding fine-grained links between documents is crucial for many applications, yet progress is limited by the lack of efficient methods for data curation. To address this limitation, we introduce a domain-agnostic framework for bootstrapping sentence-level cross-document links from scratch. Our approach (1) generates and validates semi-synthetic datasets of linked documents, (2) uses these datasets to benchmark and shortlist the best-performing linking approaches, and (3) applies the shortlisted methods in large-scale human-in-the-loop annotation of natural text pairs. We apply the framework in two distinct domains -- peer review and news -- and show that combining retrieval models with LLMs achieves a 73% human approval rate for suggested links, more than doubling the acceptance of strong retrievers alone. Our framework allows users to produce novel datasets that enable systematic study of cross-document understanding, supporting downstream tasks such as media framing analysis and peer review assessment. All code, data, and annotation protocols are released to facilitate future research.
翻译:理解文档间的细粒度链接对于众多应用至关重要,然而数据整理方法的匮乏限制了该领域的进展。为应对这一局限,我们提出了一种领域无关的框架,用于从零开始引导构建句子级别的跨文档链接。我们的方法(1)生成并验证关联文档的半合成数据集,(2)利用这些数据集对性能最佳的链接方法进行基准测试与筛选,以及(3)将筛选出的方法应用于自然文本对的大规模人机协同标注。我们在两个不同领域——同行评审与新闻——中应用该框架,结果表明:将检索模型与LLMs相结合,可使建议链接获得73%的人工认可率,较仅使用强检索器的接受率提升一倍以上。本框架使用户能够构建新颖的数据集,从而支持对跨文档理解进行系统性研究,并为媒体框架分析和同行评审评估等下游任务提供支撑。我们公开了所有代码、数据及标注协议,以促进未来研究。