The multimedia community has shown a significant interest in perceiving and representing the physical world with multimodal pretrained neural network models, and among them, the visual-language pertaining (VLP) is, currently, the most captivating topic. However, there have been few endeavors dedicated to the exploration of 1) whether essential linguistic knowledge (e.g., semantics and syntax) can be extracted during VLP, and 2) how such linguistic knowledge impact or enhance the multimodal alignment. In response, here we aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment. Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark, to detect the vital linguistic components, e.g., lexical, semantic, and syntax knowledge, containing four tasks: Semantic structure, Negation logic, Attribute ownership, and Relationship composition. Based on our proposed probing benchmarks, our holistic analyses of five advanced VLP models illustrate that the VLP model: i) shows insensitivity towards complex syntax structures and relies on content words for sentence comprehension; ii) demonstrates limited comprehension of combinations between sentences and negations; iii) faces challenges in determining the presence of actions or spatial relationships within visual information and struggles with verifying the correctness of triple combinations. We make our benchmark and code available at \url{https://github.com/WangFei-2019/SNARE/}.
翻译:多媒体社区对利用多模态预训练神经网络模型感知和表征物理世界展现出浓厚兴趣,其中视觉-语言预训练(VLP)是当前最受瞩目的课题。然而,目前鲜有研究探讨以下两个问题:1)在VLP过程中能否提取关键的语言学知识(如语义和句法);2)此类语言学知识如何影响或增强多模态对齐。为此,本文旨在阐明包括语义表达和句法结构在内的综合语言学知识对多模态对齐的影响。具体而言,我们设计并发布了首个大规模多模态对齐探测基准SNARE,用于检测词汇、语义和句法等关键语言成分,包含四项任务:语义结构、否定逻辑、属性归属和关系组合。基于所提出的探测基准,我们对五种先进VLP模型开展全面分析,结果表明VLP模型:i)对复杂句法结构不敏感,依赖实义词进行句子理解;ii)对句子与否定词的组合理解有限;iii)难以判断视觉信息中动作或空间关系的存在性,且难以验证三元组组合的正确性。我们的基准测试和代码已开源至\url{https://github.com/WangFei-2019/SNARE/}。