This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.
翻译:本文提出Fetch-A-Set(FAS),这是一个专为立法历史文献分析系统设计的综合基准,旨在应对大规模历史文献检索中的挑战。该基准包含一个可追溯至17世纪的庞大文献库,既可作训练资源,也可作检索系统评估基准。通过聚焦文化遗产领域中的复杂抽取任务,该基准填补了文献中的关键空白。所提基准致力于解决历史文献分析中的多重难题,包括查询文本到图像的检索以及文献片段的图像到文本主题提取,同时适应不同等级的文献可读性。该基准通过提供基线和数据,旨在推动鲁棒历史文献检索系统的开发与评估,特别是在涵盖广阔历史谱系的场景中,促进该领域的发展。