Social media has become a ubiquitous tool for connecting with others, staying updated with news, expressing opinions, and finding entertainment. However, understanding the intention behind social media posts remains challenging due to the implicitness of intentions in social media posts, the need for cross-modality understanding of both text and images, and the presence of noisy information such as hashtags, misspelled words, and complicated abbreviations. To address these challenges, we present MIKO, a Multimodal Intention Kowledge DistillatiOn framework that collaboratively leverages a Large Language Model (LLM) and a Multimodal Large Language Model (MLLM) to uncover users' intentions. Specifically, we use an MLLM to interpret the image and an LLM to extract key information from the text and finally instruct the LLM again to generate intentions. By applying MIKO to publicly available social media datasets, we construct an intention knowledge base featuring 1,372K intentions rooted in 137,287 posts. We conduct a two-stage annotation to verify the quality of the generated knowledge and benchmark the performance of widely used LLMs for intention generation. We further apply MIKO to a sarcasm detection dataset and distill a student model to demonstrate the downstream benefits of applying intention knowledge.
翻译:社交媒体已成为人们连接彼此、获取新闻动态、表达观点及娱乐消遣的普遍工具。然而,由于社交媒体帖子中意图表达的隐含性、对文本与图像进行跨模态理解的必要性,以及标签、拼写错误和复杂缩写等噪声信息的影响,理解其背后的意图仍具挑战性。为应对这些挑战,我们提出MIKO——一种协同利用大语言模型(LLM)与多模态大语言模型(MLLM)以揭示用户意图的多模态意图知识蒸馏框架。具体而言,我们使用MLLM解析图像内容,借助LLM提取文本关键信息,并最终指令LLM生成意图。通过将MIKO应用于公开社交媒体数据集,我们构建了一个包含基于137,287条帖子生成的1,372K条意图的知识库。我们通过两阶段标注验证生成知识的质量,并评估了多种主流LLM在意图生成任务上的性能。此外,我们进一步将MIKO应用于讽刺检测数据集,并通过蒸馏学生模型证明了意图知识在下游任务中的实用价值。