We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objective factual and a subjective affective aspect, and (2) associate images to their relevant perception experiences. We introduce **PercepT** (**Percep**tion topic **T**ransformer), a two-stage architecture that tackles P-Topics modeling. In the formation stage, percepT discovers *P-Topics* as visual-textual clusters using an unsupervised training objective, and dynamically selects the number of clusters to match the perceptual richness of the dataset. In the mapping stage, it learns *P-Topic mapping functions* via attention pooling to associate images to their respective clusters. On ArtELingo, PercepT achieves a silhouette score of **0.97** compared to **0.37** from the closest baseline reflecting better perceptual clusters. PercepT also achieves an AUC score of **0.94** compared to **0.77** showing better mapping to perceptual clusters. Human evaluation confirms that PercepT captures semantically meaningful perception experiences and significantly outperforms existing methods. Our implementation will be made public.
翻译:我们提出P-Topics(感知主题)建模,这是一项新颖的研究问题,旨在理解图像如何被情感化及跨文化地感知。其目标为:(1) 在图像与标题数据集中发现并建模不同的感知体验——每种体验由客观事实和主观情感两方面定义;(2) 将图像关联至其相关的感知体验。我们引入**PercepT**(**感知**主题**Transformer**),一种解决P-Topics建模的两阶段架构。在形成阶段,PercepT通过无监督训练目标以视觉-文本聚类形式发现*P-Topics*,并动态选择聚类数量以匹配数据集的感知丰富度。在映射阶段,它通过注意力池化学习*P-Topic映射函数*,将图像关联至对应聚类。在ArtELingo数据集上,PercepT的轮廓系数达到**0.97**,而最接近基线仅为**0.37**,表明其生成更优的感知聚类;其AUC分数为**0.94**(基线**0.77**),证明其具有更佳的感知聚类映射能力。人工评估证实,PercepT能捕获语义上有意义的感知体验,且显著优于现有方法。我们的实现将公开。