Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/few-shots cross-lingual transfer). Nowadays, cross-lingual text classifiers are typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. However, the performance of these models vary significantly across languages and classification tasks, suggesting that the superposition of the language modelling and classification tasks is not always effective. For this reason, in this paper we propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages. The proposed approach couples 1) a neural machine translator translating from the targeted language to a high-resource language, with 2) a text classifier trained in the high-resource language, but the neural machine translator generates "soft" translations to permit end-to-end backpropagation during fine-tuning of the pipeline. Extensive experiments have been carried out over three cross-lingual text classification datasets (XNLI, MLDoc and MultiEURLEX), with the results showing that the proposed approach has significantly improved performance over a competitive baseline.
翻译:跨语言文本分类利用高资源语言训练的文本分类器,通过零样本或少量样本的跨语言迁移,在无需或仅需极少微调的情况下对其他语言执行文本分类任务。当前跨语言文本分类器通常基于大规模多语言预训练语言模型构建,这些模型在多种目标语言上完成预训练。然而,这类模型在不同语言和分类任务中的表现差异显著,表明语言建模与分类任务的叠加并非始终有效。基于此,本文提出重新审视经典的"翻译-测试"流水线,通过分离翻译与分类阶段实现更清晰的架构设计。所提方法包含两个核心组件:1)将目标语言翻译为高资源语言的神经机器翻译器;2)基于高资源语言训练的文本分类器。特别地,神经机器翻译器生成"软"翻译结果以支持流水线微调过程中的端到端反向传播。我们在三个跨语言文本分类数据集(XNLI、MLDoc和MultiEURLEX)上开展大量实验,结果表明所提方法相较强基线模型取得了显著性能提升。