Seeing the Intangible: Surveying Automatic High-Level Visual Understanding from Still Images

The field of Computer Vision (CV) was born with the single grand goal of complete image understanding: providing a complete semantic interpretation of an input image. What exactly this goal entails is not immediately straightforward, but theoretical hierarchies of visual understanding point towards a top level of full semantics, within which sits the most complex and subjective information humans can detect from visual data. In particular, non-concrete concepts including emotions, social values and ideologies seem to be protagonists of this "high-level" visual semantic understanding. While such "abstract concepts" are critical tools for image management and retrieval, their automatic recognition is still a challenge, exactly because they rest at the top of the "semantic pyramid": the well-known semantic gap problem is worsened given their lack of unique perceptual referents, and their reliance on more unspecific features than concrete concepts. Given that there seems to be very scarce explicit work within CV on the task of abstract social concept (ASC) detection, and that many recent works seem to discuss similar non-concrete entities by using different terminology, in this survey we provide a systematic review of CV work that explicitly or implicitly approaches the problem of abstract (specifically social) concept detection from still images. Specifically, this survey performs and provides: (1) A study and clustering of high level visual understanding semantic elements from a multidisciplinary perspective (computer science, visual studies, and cognitive perspectives); (2) A study and clustering of high level visual understanding computer vision tasks dealing with the identified semantic elements, so as to identify current CV work that implicitly deals with AC detection.

翻译：计算机视觉领域自诞生之初便以完全图像理解这一宏大目标为己任：对输入图像提供完整的语义诠释。这一目标的具体内涵并非一目了然，但视觉理解的理论层级指向了完整的语义顶层，其中包含了人类能从视觉数据中感知到的最复杂和最主观的信息。特别是，情感、社会价值观和意识形态等非具体概念，似乎是这种“高层次”视觉语义理解的核心主题。虽然此类“抽象概念”是图像管理和检索的关键工具，但其自动识别仍是一项挑战，恰恰因为它们处于“语义金字塔”的顶端：由于缺乏独特的感知参照物，且相比具体概念更依赖于非特定特征，众所周知的语义鸿沟问题在此更为严重。鉴于计算机视觉领域中关于抽象社会概念检测的明确工作似乎非常稀少，且近期许多研究使用不同术语讨论类似非具体实体，本综述对显式或隐式地处理静态图像中抽象（特别是社会）概念检测问题的计算机视觉工作进行了系统梳理。具体而言，本综述完成并提供了：(1) 从多学科视角（计算机科学、视觉研究与认知视角）对高层次视觉理解语义要素的研究与聚类；(2) 对处理上述已识别语义要素的高层次视觉理解计算机视觉任务的研究与聚类，以明确当前计算机视觉领域中隐式处理抽象概念检测的研究工作。