With the rapid development of the internet, online social media welcomes people with different backgrounds through its diverse content. The increasing usage of emoji becomes a noticeable trend thanks to emoji's rich information beyond cultural or linguistic borders. However, the current study on emojis is limited to single emoji prediction and there are limited data resources available for further study of the interesting linguistic phenomenon. To this end, we synthesize a large text-emoji parallel corpus, Text2Emoji, from a large language model. Based on the parallel corpus, we distill a sequence-to-sequence model, EmojiLM, which is specialized in the text-emoji bidirectional translation. Extensive experiments on public benchmarks and human evaluation demonstrate that our proposed model outperforms strong baselines and the parallel corpus benefits emoji-related downstream tasks.
翻译:随着互联网的快速发展,在线社交媒体以其丰富多元的内容吸引了来自不同背景的用户。表情符号因其跨越文化与语言边界的丰富信息,使用频率日益增长,成为显著趋势。然而,当前对表情符号的研究局限于单一表情符号预测,且可供进一步研究这一有趣语言现象的数据资源十分有限。为此,我们利用大语言模型合成了大规模文本-表情符号平行语料库Text2Emoji。基于该平行语料库,我们蒸馏出专精于文本与表情符号双向翻译的序列到序列模型EmojiLM。在公开基准测试与人工评估上的大量实验表明,我们提出的模型优于强基线模型,且该平行语料库对表情符号相关下游任务具有显著促进作用。