Recent work in Machine Learning and Computer Vision has highlighted the presence of various types of systematic flaws inside ground truth object recognition benchmark datasets. Our basic tenet is that these flaws are rooted in the many-to-many mappings which exist between the visual information encoded in images and the intended semantics of the labels annotating them. The net consequence is that the current annotation process is largely under-specified, thus leaving too much freedom to the subjective judgment of annotators. In this paper, we propose vTelos, an integrated Natural Language Processing, Knowledge Representation, and Computer Vision methodology whose main goal is to make explicit the (otherwise implicit) intended annotation semantics, thus minimizing the number and role of subjective choices. A key element of vTelos is the exploitation of the WordNet lexico-semantic hierarchy as the main means for providing the meaning of natural language labels and, as a consequence, for driving the annotation of images based on the objects and the visual properties they depict. The methodology is validated on images populating a subset of the ImageNet hierarchy.
翻译:近期机器学习和计算机视觉领域的研究揭示了真实物体识别基准数据集中存在多种系统性缺陷。我们的核心论点是,这些缺陷源于图像编码的视觉信息与标注标签的语义意图之间存在多对多映射关系。根本后果在于当前标注过程严重缺乏明确规范,从而为标注人员的主观判断留下了过多自由空间。本文提出了一种集成自然语言处理、知识表示和计算机视觉方法的vTelos框架,其核心目标是显式化原本隐含的标注语义意图,从而最大程度减少主观选择的数量与影响。vTelos的关键要素在于利用WordNet词汇语义层级作为自然语言标签语义内涵的主要载体,进而根据图像所描绘的物体及其视觉属性驱动标注工作。该方法已在ImageNet层级子集的图像数据集上完成验证。