Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

The increasing demand for automatic high-level image understanding, particularly in detecting abstract concepts (AC) within images, underscores the necessity for innovative and more interpretable approaches. These approaches need to harmonize traditional deep vision methods with the nuanced, context-dependent knowledge humans employ to interpret images at intricate semantic levels. In this work, we leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. We compute KG embeddings and experiment with relative representations and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. Our results show that our hybrid KGE-ViT methods outperform existing techniques in AC image classification. The posthoc interpretability analyses reveal the visual transformer's proficiency in capturing pixel-level visual attributes, contrasting with our method's efficacy in representing more abstract and semantic scene elements. We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neuro-symbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available online.

翻译：随着自动高层级图像理解需求的增长，特别是在检测图像中的抽象概念（AC）方面，亟需创新且更具可解释性的方法。这些方法需要将传统深度视觉方法与人类在复杂语义层面解读图像时所运用的、依赖上下文的细微知识相协调。本研究利用文化图像的情境感知知识，以提升抽象概念图像分类的性能与可解释性。我们自动从图像中提取感知语义单元，将其建模并整合至ARTstract知识图谱（AKG）中。该资源涵盖了从超过14,000张标注有抽象概念的文化图像中获取的情境感知语义信息。此外，我们通过高层级语言框架增强AKG。我们计算知识图谱嵌入（KGE），并尝试采用相对表示及融合视觉Transformer嵌入的混合方法。最后，为提升可解释性，我们通过检验模型与训练实例的相似性进行事后定性分析。结果表明，我们的混合KGE-ViT方法在抽象概念图像分类中优于现有技术。事后可解释性分析揭示了视觉Transformer在捕捉像素级视觉属性方面的优势，而我们的方法则在表征更抽象、语义化的场景元素方面更为有效。我们论证了KGE嵌入的情境感知知识与深度视觉模型的感官-知觉理解在抽象概念图像分类中的协同互补作用。本研究展示了神经符号方法在知识整合与鲁棒图像表征方面的巨大潜力，可应用于后续复杂的视觉理解任务。所有材料与代码均已在线公开。