Recently published graph neural networks (GNNs) show promising performance at social event detection tasks. However, most studies are oriented toward monolingual data in languages with abundant training samples. This has left the more common multilingual settings and lesser-spoken languages relatively unexplored. Thus, we present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams. The first exploit is to make the GNN work with multilingual data. For this, we outline a construction strategy that aligns messages in different languages at both the node and semantic levels. Relationships between messages are established by merging entities that are the same but are referred to in different languages. Non-English message representations are converted into English semantic space via the cross-lingual word embeddings. The resulting message graph is then uniformly encoded by a GNN model. In special cases where a lesser-spoken language needs to be detected, a novel cross-lingual knowledge distillation framework, called CLKD, exploits prior knowledge learned from similar threads in English to make up for the paucity of annotated data. Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
翻译:近期发表的图神经网络(GNN)在社交事件检测任务中展现出优异性能。然而,多数研究仅针对训练样本充足的单一语言数据,导致更普遍的多语言场景及小众语言领域尚未得到充分探索。为此,我们提出一种融合跨语言词嵌入的GNN模型用于多语言数据流事件检测。首要创新在于使GNN能够处理多语言数据:我们设计了节点级与语义级的跨语言消息对齐策略,通过合并不同语言中的同一实体来建立消息关联,并借助跨语言词嵌入将非英语消息表征转换至英语语义空间,最终由GNN模型统一编码生成消息图。针对小众语言检测的特殊场景,我们提出新型跨语言知识蒸馏框架CLKD,利用从英语相似语料中习得的先验知识弥补标注数据不足。在合成数据集与真实数据集上的实验表明,该框架在多语言数据及训练样本稀缺语言的检测任务中均具有显著有效性。