Domain shift is a big challenge in NLP, thus, many approaches resort to learning domain-invariant features to mitigate the inference phase domain shift. Such methods, however, fail to leverage the domain-specific nuances relevant to the task at hand. To avoid such drawbacks, domain counterfactual generation aims to transform a text from the source domain to a given target domain. However, due to the limited availability of data, such frequency-based methods often miss and lead to some valid and spurious domain-token associations. Hence, we employ a three-step domain obfuscation approach that involves frequency and attention norm-based masking, to mask domain-specific cues, and unmasking to regain the domain generic context. Our experiments empirically show that the counterfactual samples sourced from our masked text lead to improved domain transfer on 10 out of 12 domain sentiment classification settings, with an average of 2% accuracy improvement over the state-of-the-art for unsupervised domain adaptation (UDA). Further, our model outperforms the state-of-the-art by achieving 1.4% average accuracy improvement in the adversarial domain adaptation (ADA) setting. Moreover, our model also shows its domain adaptation efficacy on a large multi-domain intent classification dataset where it attains state-of-the-art results. We release the codes publicly at \url{https://github.com/declare-lab/remask}.
翻译:域偏移是自然语言处理中的重大挑战,因此许多方法致力于学习域不变特征以减轻推理阶段的域偏移。然而,此类方法未能利用与任务相关的域特定细微差异。为避免这一缺陷,域反事实生成旨在将文本从源域转换至目标域。但受数据可用性限制,这类基于频率的方法常遗漏有效关联,并产生虚假的域-词关联。为此,我们采用三步域混淆方法:基于频率和注意力范数的掩蔽以屏蔽域特定线索,随后进行解掩蔽以恢复域通用上下文。实验表明,在12个域情感分类设置中,源于我们掩蔽文本的反事实样本在10个设置上提升了域迁移性能,在无监督域适应(UDA)中相比现有最优方法平均准确率提升2%。此外,我们的模型在对抗域适应(ADA)设置中平均准确率提升1.4%,超越当前最优水平。同时,模型在大规模多域意图分类数据集上展现了域适应能力,取得最优结果。代码已开源至:\url{https://github.com/declare-lab/remask}。