Convolutional neural networks and vision transformers have achieved outstanding performance in machine perception, particularly for image classification. Although these image classifiers excel at predicting image-level class labels, they may not discriminate missing or shifted parts within an object. As a result, they may fail to detect corrupted images that involve missing or disarrayed semantic information in the object composition. On the contrary, human perception easily distinguishes such corruptions. To mitigate this gap, we introduce the concept of "image grammar", consisting of "image semantics" and "image syntax", to denote the semantics of parts or patches of an image and the order in which these parts are arranged to create a meaningful object. To learn the image grammar relative to a class of visual objects/scenes, we propose a weakly supervised two-stage approach. In the first stage, we use a deep clustering framework that relies on iterative clustering and feature refinement to produce part-semantic segmentation. In the second stage, we incorporate a recurrent bi-LSTM module to process a sequence of semantic segmentation patches to capture the image syntax. Our framework is trained to reason over patch semantics and detect faulty syntax. We benchmark the performance of several grammar learning models in detecting patch corruptions. Finally, we verify the capabilities of our framework in Celeb and SUNRGBD datasets and demonstrate that it can achieve a grammar validation accuracy of 70 to 90% in a wide variety of semantic and syntactical corruption scenarios.
翻译:卷积神经网络与视觉Transformer在机器感知领域取得了卓越性能,尤其在图像分类任务中。尽管这些图像分类器擅长预测图像级类别标签,但它们可能无法区分物体内部缺失或移位部分,因而难以检测涉及物体构成中语义信息缺失或混乱的损坏图像。相比之下,人类感知能轻易辨别此类损坏。为弥合这一差距,我们引入"图像语法"概念,包含"图像语义"与"图像句法",分别表示图像局部或区块的语义信息,以及这些部分为构成有意义物体而排列的顺序。为学习特定视觉物体/场景类别的图像语法,我们提出一种弱监督两阶段方法。第一阶段,采用基于迭代聚类与特征精化的深度聚类框架,生成局部语义分割。第二阶段,引入循环双向长短期记忆模块处理语义分割区块序列,以捕获图像句法。该框架通过推理区块语义并检测错误句法进行训练。我们基准测试了多种语法学习模型在检测区块损坏方面的性能,并在Celeb与SUNRGBD数据集上验证了框架能力,实验表明在多种语义与句法损坏场景下,语法验证准确率可达70%至90%。