Named Entity Linking (NEL) is a core component of biomedical Information Extraction (IE) pipelines, yet assessing its quality at scale is challenging due to the high cost of expert annotations and the large size of corpora. In this paper, we present a sampling-based framework to estimate the NEL accuracy of large-scale IE corpora under statistical guarantees and constrained annotation budgets. We frame NEL accuracy estimation as a constrained optimization problem, where the objective is to minimize expected annotation cost subject to a target Margin of Error (MoE) for the corpus-level accuracy estimate. Building on recent works on knowledge graph accuracy estimation, we adapt Stratified Two-Stage Cluster Sampling (STWCS) to the NEL setting, defining label-based strata and global surface-form clusters in a way that is independent of NEL annotations. Applied to 11,184 NEL annotations in GutBrainIE -- a new biomedical corpus openly released in fall 2025 -- our framework reaches a MoE $\leq 0.05$ by manually annotating only 2,749 triples (24.6%), leading to an overall accuracy estimate of $0.915 \pm 0.0473$. A time-based cost model and simulations against a Simple Random Sampling (SRS) baseline show that our design reduces expert annotation time by about 29% at fixed sample size. The framework is generic and can be applied to other NEL benchmarks and IE pipelines that require scalable and statistically robust accuracy assessment.
翻译:命名实体链接(NEL)是生物医学信息抽取(IE)流程的核心组件,然而由于专家标注成本高昂且语料库规模庞大,大规模评估其质量具有挑战性。本文提出一种基于抽样的框架,用于在统计保证和有限标注预算下估计大规模IE语料库的NEL准确率。我们将NEL准确率估计构建为一个约束优化问题,其目标是在语料库级准确率估计的目标误差边界(MoE)约束下最小化预期标注成本。基于近期知识图谱准确率估计的研究,我们将分层两阶段整群抽样(STWCS)方法适配至NEL场景,定义了基于标签的分层和全局表层形式聚类,且该方法独立于NEL标注结果。在GutBrainIE(2025年秋季公开发布的新型生物医学语料库)的11,184个NEL标注数据上应用本框架,仅通过人工标注2,749个三元组(24.6%)即可实现误差边界≤0.05,最终获得整体准确率估计值为$0.915 \pm 0.0473$。基于时间的成本模型及与简单随机抽样(SRS)基线的对比模拟表明,在固定样本量下我们的设计可减少约29%的专家标注时间。该框架具有通用性,可应用于其他需要可扩展且统计稳健的准确率评估的NEL基准测试和IE流程。