We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers. ICE-ID combines hierarchical geography (farm$\to$parish$\to$district$\to$county), patronymic naming conventions, sparse kinship links (partner, father, mother), and multi-decadal temporal drift -- challenges not captured by standard product-matching or citation datasets. This paper presents an artifact-backed analysis of temporal coverage, missingness, identifier ambiguity, candidate-generation efficiency, and cluster distributions, and situates ICE-ID against classical ER benchmarks (Abt--Buy, Amazon--Google, DBLP--ACM, DBLP--Scholar, Walmart--Amazon, iTunes--Amazon, Beer, Fodors--Zagats). We also define a deployment-faithful temporal OOD protocol and release the dataset, splits, regeneration scripts, analysis artifacts, and a dashboard for interactive exploration. Baseline model comparisons and end-to-end ER results are reported in the companion methods paper.
翻译:本文介绍\textbf{ICE-ID}基准数据集,该数据集包含跨越220年(1703–1920年)的16次冰岛人口普查浪潮中的984,028条记录,并附有226,864个专家标注的个人标识符。ICE-ID融合了层级地理结构(农场→教区→区→县)、父名命名惯例、稀疏亲属关系链接(伴侣、父亲、母亲)以及跨数十年的时间漂移——这些挑战均未被标准产品匹配或引文数据集所涵盖。本文基于实际数据对时间覆盖度、缺失值、标识符歧义、候选生成效率及聚类分布进行了分析,并将ICE-ID与经典实体解析基准(Abt–Buy、Amazon–Google、DBLP–ACM、DBLP–Scholar、Walmart–Amazon、iTunes–Amazon、Beer、Fodors–Zagats)进行了对比。我们还定义了一个符合实际部署需求的时序分布外泛化评估协议,并公开了数据集、数据划分、再生脚本、分析工具及用于交互式探索的仪表板。基线模型对比与端到端实体解析结果已在配套方法论文中报告。