The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short in aligning with the unique speaking style and context of social media tasks. In this paper, we introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM), which is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. SoMeLVLM is designed to understand and generate realistic social media behavior. We have developed a 654k multimodal social media instruction-tuning dataset to support our cognitive framework and fine-tune our model. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks. Further analysis shows its significant advantages over baselines in terms of cognitive abilities.
翻译:社交媒体的增长以其多模态特性为特点,催生了多种现象和挑战,需要一种统一有效的方法来自动化处理相关任务。强大的大型视觉语言模型使得同时处理多种任务成为可能,但即使采用精心设计的提示方法,通用领域模型在适应社交媒体的独特表达风格和语境方面仍常显不足。本文提出了一种面向社交媒体处理的大型视觉语言模型(SoMeLVLM),该模型是一个配备五种核心能力的认知框架,包括:知识与理解、应用、分析、评估和创造。SoMeLVLM旨在理解和生成真实的社交媒体行为。我们开发了一个包含65.4万条多模态社交媒体指令微调数据集,以支撑该认知框架并完成模型微调。实验表明,SoMeLVLM在多个社交媒体任务中达到了最先进水平。进一步分析显示,其在认知能力上相较基线模型具有显著优势。