Emojis have gained immense popularity on social platforms, serving as a common means to supplement or replace text. However, existing data mining approaches generally either completely ignore or simply treat emojis as ordinary Unicode characters, which may limit the model's ability to grasp the rich semantic information in emojis and the interaction between emojis and texts. Thus, it is necessary to release the emoji's power in social media data mining. To this end, we first construct a heterogeneous graph consisting of three types of nodes, i.e. post, word and emoji nodes to improve the representation of different elements in posts. The edges are also well-defined to model how these three elements interact with each other. To facilitate the sharing of information among post, word and emoji nodes, we propose a graph pre-train framework for text and emoji co-modeling, which contains two graph pre-training tasks: node-level graph contrastive learning and edge-level link reconstruction learning. Extensive experiments on the Xiaohongshu and Twitter datasets with two types of downstream tasks demonstrate that our approach proves significant improvement over previous strong baseline methods.
翻译:表情符号在社交平台上已获得巨大流行,成为补充或替代文本的常用手段。然而,现有数据挖掘方法通常要么完全忽略表情符号,要么仅将其视为普通Unicode字符,这可能限制模型把握表情符号丰富语义信息及其与文本交互关系的能力。因此,有必要在社交媒体数据挖掘中释放表情符号的潜力。为此,我们首先构建了一个包含三种节点类型(即帖子、词语和表情符号节点)的异构图,以改进帖子中不同元素的表示。边关系也经过精确定义,以建模这三种元素如何相互交互。为促进帖子、词语和表情符号节点间的信息共享,我们提出了面向文本与表情符号协同建模的图预训练框架,该框架包含两个图预训练任务:节点级图对比学习与边级链接重构学习。在小红书和Twitter数据集上针对两类下游任务进行的广泛实验表明,我们的方法相较于先前强基线模型取得了显著提升。