This paper explores the grounding issue regarding multimodal semantic representation from a computational cognitive-linguistic view. We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche Association (ENA), and examine their association with textual elements in the image captions. Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance. Perceptual Salience, Object Number, and ENA are also associated with the choice of linguistic expressions. Our study demonstrates that comprehensive understanding of objects or events requires cognitive attention, semantic nuances in language, and integration across multiple modalities. We highlight the vital importance of situated meaning and affordance grounding in natural language understanding, with the potential to advance human-like interpretation in various scenarios.
翻译:本文从计算认知语言学视角探讨多模态语义表征中的基础性问题。我们对Flickr30k数据集中的图像标注了五种感知属性:可供性、感知显著性、物体数量、视线引导和生态位关联(ENA),并考察这些属性与图像描述文本元素的关联性。研究发现,与展示目的性可供性的图像相比,具有吉布森式可供性的图像对应的描述文本中出现"持有动词"和"容器名词"的频率更高。感知显著性、物体数量和生态位关联同样影响语言表达的选择。本研究表明,对物体或事件的全面理解需要认知关注、语言语义细微差别以及跨模态整合。我们强调情境意义和可供性基础在自然语言理解中的关键作用,这将推动多种场景下类人化解读能力的发展。