Accurately aligning contextual representations in cross-lingual sentence embeddings is key for effective parallel data mining. A common strategy for achieving this alignment involves disentangling semantics and language in sentence embeddings derived from multilingual pre-trained models. However, we discover that current disentangled representation learning methods suffer from semantic leakage - a term we introduce to describe when a substantial amount of language-specific information is unintentionally leaked into semantic representations. This hinders the effective disentanglement of semantic and language representations, making it difficult to retrieve embeddings that distinctively represent the meaning of the sentence. To address this challenge, we propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE), tailored to enforce orthogonality between semantic and language embeddings. ORACLE builds upon two components: intra-class clustering and inter-class separation. Through experiments on cross-lingual retrieval and semantic textual similarity tasks, we demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.
翻译:在跨语言句子嵌入中准确对齐上下文表示是有效进行平行数据挖掘的关键。实现这种对齐的常见策略涉及从多语言预训练模型衍生的句子嵌入中解耦语义与语言信息。然而,我们发现当前的解耦表示学习方法存在语义泄漏问题——我们引入这一术语来描述大量语言特定信息被无意间泄漏到语义表示中的现象。这阻碍了语义与语言表示的有效解耦,使得难以检索到能独特表征句子含义的嵌入。为应对这一挑战,我们提出了一种新颖的训练目标——正交性约束学习(ORACLE),专门用于强制语义嵌入与语言嵌入之间的正交性。ORACLE建立在两个组件之上:类内聚类与类间分离。通过在跨语言检索和语义文本相似性任务上的实验,我们证明使用ORACLE目标进行训练能有效减少语义泄漏,并增强嵌入空间内的语义对齐。