Distant viewing approaches have typically used image datasets close to the contemporary image data used to train machine learning models. To work with images from other historical periods requires expert annotated data, and the quality of labels is crucial for the quality of results. Especially when working with cultural heritage collections that contain myriad uncertainties, annotating data, or re-annotating, legacy data is an arduous task. In this paper, we describe working with two pre-annotated sets of medieval manuscript images that exhibit conflicting and overlapping metadata. Since a manual reconciliation of the two legacy ontologies would be very expensive, we aim (1) to create a more uniform set of descriptive labels to serve as a "bridge" in the combined dataset, and (2) to establish a high quality hierarchical classification that can be used as a valuable input for subsequent supervised machine learning. To achieve these goals, we developed visualization and interaction mechanisms, enabling medievalists to combine, regularize and extend the vocabulary used to describe these, and other cognate, image datasets. The visual interfaces provide experts an overview of relationships in the data going beyond the sum total of the metadata. Word and image embeddings as well as co-occurrences of labels across the datasets, enable batch re-annotation of images, recommendation of label candidates and support composing a hierarchical classification of labels.
翻译:远距离观察方法通常使用与训练机器学习模型所用的当代图像数据相近的图像数据集。要处理其他历史时期的图像,需要专家标注的数据,且标签质量对结果质量至关重要。尤其是在处理包含诸多不确定性的文化遗产藏品时,标注数据或重新标注既有数据是一项艰巨任务。本文描述了对两组存在冲突和重叠元数据的预标注中世纪手稿图像集的处理过程。由于手动协调这两种既有本体代价高昂,我们旨在:(1)创建一组更统一的描述性标签,作为合并数据集中的“桥梁”;(2)建立高质量的分层分类体系,为后续监督式机器学习提供宝贵输入。为实现这些目标,我们开发了可视化与交互机制,使中世纪学研究者能够合并、规范并扩展用于描述这些及相关图像数据集的词汇。可视化界面为专家提供了超越元数据总和的数据关系概览。单词与图像嵌入向量及标签跨数据集的共现关系,实现了图像的批量重新标注、候选标签推荐,并支持构建标签的分层分类体系。