Metadata quality is crucial for digital objects to be discovered through digital library interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We tested MetaEnhance on this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven fields.
翻译:元数据质量对于数字对象通过数字图书馆界面被发现至关重要。然而,由于各种原因,数字对象的元数据常出现不完整、不一致和错误的值。我们研究自动检测、校正和规范化学术元数据的方法,以电子学位论文(ETDs)的七个关键字段作为案例研究。我们提出MetaEnhance框架,该框架利用最先进的人工智能方法提升这些字段的质量。为评估MetaEnhance,我们构建了一个包含500份ETD的元数据质量评估基准,通过结合多种标准采样的子集形成。我们在该基准上测试MetaEnhance,发现所提出方法在错误检测中达到近乎完美的F1分数,在七个字段中的五个字段上,错误校正的F1分数范围为0.85至1.00。