Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages. However, previous works rely on shallow unsupervised data generated by token surface matching, regardless of the global context-aware semantics of the surrounding text tokens. In this paper, we propose an Unsupervised Pseudo Semantic Data Augmentation (UniPSDA) mechanism for cross-lingual natural language understanding to enrich the training data without human interventions. Specifically, to retrieve the tokens with similar meanings for the semantic data augmentation across different languages, we propose a sequential clustering process in 3 stages: within a single language, across multiple languages of a language family, and across languages from multiple language families. Meanwhile, considering the multi-lingual knowledge infusion with context-aware semantics while alleviating computation burden, we directly replace the key constituents of the sentences with the above-learned multi-lingual family knowledge, viewed as pseudo-semantic. The infusion process is further optimized via three de-biasing techniques without introducing any neural parameters. Extensive experiments demonstrate that our model consistently improves the performance on general zero-shot cross-lingual natural language understanding tasks, including sequence classification, information extraction, and question answering.
翻译:跨语言表示学习通过将知识从资源丰富的数据迁移至资源匮乏的数据,以提升不同语言的语义理解能力。然而,先前工作依赖于通过词元表面匹配生成的浅层无监督数据,未考虑周围文本词元的全局上下文感知语义。本文提出一种用于跨语言自然语言理解的无监督伪语义数据增强机制,旨在无需人工干预的情况下丰富训练数据。具体而言,为检索跨不同语言中具有相似含义的词元以进行语义数据增强,我们设计了一个三阶段的序列聚类流程:在单一语言内部、同一语系的多语言之间,以及跨多语系语言之间。同时,为在注入上下文感知语义的多语言知识时减轻计算负担,我们直接将句子中的关键成分替换为上述习得的多语言语系知识,并将其视为伪语义。该注入过程通过三种去偏技术进一步优化,且未引入任何神经参数。大量实验表明,我们的模型在通用零样本跨语言自然语言理解任务(包括序列分类、信息抽取和问答)上持续提升性能。