Stickers have become a ubiquitous part of modern-day communication, conveying complex emotions through visual imagery. To facilitate the development of more powerful algorithms for analyzing stickers, we propose a large-scale Chinese sticker dataset, namely Sticker820K, which consists of 820k image-text pairs. Each sticker has rich and high-quality textual annotations, including descriptions, optical characters, emotional labels, and style classifications. Although vision-language tasks in the domain of natural images have been well studied, directly applying the those models, such as CLIP, to sticker data is not an optimal solution due to the discrepant nature between natural and emotive image data. Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset. For the text-to-image retrieval task, our StickerCLIP demonstrates strong superiority over the CLIP, which achieves an absolute gain of 66.0\% in mean recall on the Sticker820K test set. Additionally, we endeavor to extend the recently popularized LLM by means of prompt tuning, integrating its ability for sticker retrieval and allowing users to retrieve stickers through instructions. We validate the feasibility of this method, demonstrating the immense potential of prompt tuning in expanding LLM abilities while not affecting the quality of upstream tasks.
翻译:贴纸已成为现代通信中不可或缺的组成部分,通过视觉图像传达复杂情感。为促进更强大的贴纸分析算法开发,我们提出大规模中文贴纸数据集Sticker820K,包含82万图像-文本对。每张贴纸均配有丰富的高质量文本标注,涵盖描述文本、光学字符、情感标签及风格分类。尽管自然图像领域的视觉语言任务已得到充分研究,但由于自然图像与情感图像数据的本质差异,直接应用CLIP等模型处理贴纸数据并非最优方案。为此,我们提出StickerCLIP作为Sticker820K数据集的基准模型。在文本到图像检索任务中,StickerCLIP展现出显著优于CLIP的性能,在Sticker820K测试集上平均召回率绝对提升达66.0%。此外,我们通过提示调优方法拓展近期流行的LLM能力,将其与贴纸检索功能整合,使用户可通过指令检索贴纸。实验验证了该方法的可行性,表明提示调优在保持上游任务质量的同时,拓展LLM能力具有巨大潜力。