Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

The purpose of semantic location prediction is to extract relevant semantic location information from multimodal social media posts, offering a more contextual understanding of daily activities compared to GPS coordinates. However, this task becomes challenging due to the presence of noise and irrelevant information in "text-image" pairs. Existing methods suffer from insufficient feature representations and fail to consider the comprehensive integration of similarity at different granularities, making it difficult to filter out noise and irrelevant information. To address these challenges, we propose a Similarity-Guided Multimodal Fusion Transformer (SG-MFT) for predicting social users' semantic locations. First, we utilize a pre-trained large-scale vision-language model to extract high-quality feature representations from social media posts. Then, we introduce a Similarity-Guided Interaction Module (SIM) to alleviate modality heterogeneity and noise interference by incorporating coarse-grained and fine-grained similarity guidance for modality interactions. Specifically, we propose a novel similarity-aware feature interpolation attention mechanism at the coarse level, leveraging modality-wise similarity to mitigate heterogeneity and reduce noise within each modality. Meanwhile, we employ a similarity-aware feed-forward block at the fine level, utilizing element-wise similarity to further mitigate the impact of modality heterogeneity. Building upon pre-processed features with minimal noise and modal interference, we propose a Similarity-aware Feature Fusion Module (SFM) to fuse two modalities with cross-attention mechanism. Comprehensive experimental results demonstrate the superior performance of our proposed method in handling modality imbalance while maintaining efficient fusion effectiveness.

翻译：语义位置预测的目标是从多模态社交媒体帖子中提取相关语义位置信息，相比GPS坐标能为日常活动提供更丰富的上下文理解。然而，由于"文本-图像"对中存在噪声和无关信息，该任务面临挑战。现有方法存在特征表示不充分、未能考虑不同粒度相似性的综合整合等问题，难以有效过滤噪声和无关信息。为解决这些挑战，我们提出一种相似性引导的多模态融合Transformer（SG-MFT）用于预测社交用户的语义位置。首先，利用预训练的大规模视觉-语言模型从社交媒体帖子中提取高质量特征表示。随后，引入相似性引导交互模块（SIM），通过融合粗粒度与细粒度的相似性引导来缓解模态异质性和噪声干扰。具体而言，在粗粒度层面提出新颖的相似性感知特征插值注意力机制，利用模态间相似性减轻异质性并降低各模态内部噪声；在细粒度层面采用相似性感知前馈模块，通过元素级相似性进一步削弱模态异质性的影响。基于经过预处理、噪声和模态干扰最小化的特征，我们提出相似性感知特征融合模块（SFM），通过交叉注意力机制融合两种模态。综合实验结果表明，所提方法在处理模态不平衡的同时保持高效融合效果方面展现出优越性能。