TagGPT: Large Language Models are Zero-shot Multimodal Taggers

Tags are pivotal in facilitating the effective distribution of multimedia content in various applications in the contemporary Internet era, such as search engines and recommendation systems. Recently, large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. In this work, we propose TagGPT, a fully automated system capable of tag extraction and multimodal tagging in a completely zero-shot fashion. Our core insight is that, through elaborate prompt engineering, LLMs are able to extract and reason about proper tags given textual clues of multimodal data, e.g., OCR, ASR, title, etc. Specifically, to automatically build a high-quality tag set that reflects user intent and interests for a specific application, TagGPT predicts large-scale candidate tags from a series of raw data via prompting LLMs, filtered with frequency and semantics. Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i.e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts. It is well noticed that TagGPT provides a system-level solution based on a modular framework equipped with a pre-trained LLM (GPT-3.5 used here) and a sentence embedding model (SimCSE used here), which can be seamlessly replaced with any more advanced one you want. TagGPT is applicable for various modalities of data in modern social media and showcases strong generalization ability to a wide range of applications. We evaluate TagGPT on publicly available datasets, i.e., Kuaishou and Food.com, and demonstrate the effectiveness of TagGPT compared to existing hashtags and off-the-shelf taggers. Project page: https://github.com/TencentARC/TagGPT.

翻译：标签在当代互联网时代各类应用（如搜索引擎与推荐系统）中，对于有效分发多媒体内容起着关键作用。近期，大语言模型在广泛任务上展现出令人瞩目的能力。本研究提出TagGPT——一种完全自动化的系统，能够以零样本方式实现标签提取与多模态标注。其核心洞见在于：通过精巧的提示工程，大语言模型能够从多模态数据的文本线索（如光学字符识别、自动语音识别、标题等）中提取并推理出合适标签。具体而言，为自动构建反映特定应用场景中用户意图与兴趣的高质量标签集，TagGPT通过提示大语言模型从一系列原始数据中预测大规模候选标签，并依据频率与语义进行过滤。针对需要标注以进行分发的新实体，TagGPT引入两种零样本标注备选方案：一是结合标签集进行后期语义匹配的生成式方法，二是通过提示进行早期匹配的选择式方法。值得注意的是，TagGPT提供了一种基于模块化框架的系统级解决方案，该框架配备预训练大语言模型（此处使用GPT-3.5）与句子嵌入模型（此处使用SimCSE），并可无缝替换为任意更先进的模型。TagGPT适用于现代社交媒体中多种模态数据，并在广泛应用中展现出强大的泛化能力。我们在公开数据集（即Kuaishou与Food.com）上对TagGPT进行评估，相较现有话题标签与现成标签器，验证了其有效性。项目页面：https://github.com/TencentARC/TagGPT。