While impressive performance has been achieved on the task of Answer Sentence Selection (AS2) for English, the same does not hold for languages that lack large labeled datasets. In this work, we propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages in the tasks without the need of labeled data for the target language. To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages. We conduct extensive experiments on Xtr-WikiQA and TyDi-AS2 with multiple teachers, diverse monolingual and multilingual pretrained language models (PLMs) as students, and both monolingual and multilingual training. The results demonstrate that CLKD either outperforms or rivals even supervised fine-tuning with the same amount of labeled data and a combination of machine translation and the teacher model. Our method can potentially enable stronger AS2 models for low-resource languages, while TyDi-AS2 can serve as the largest multilingual AS2 dataset for further studies in the research community.
翻译:尽管英语答案句子选择(AS2)任务已取得显著性能,但缺乏大规模标注数据的语言仍面临挑战。本文提出跨语言知识蒸馏(CLKD)方法,利用强大的英语AS2教师模型,在无需目标语言标注数据的情况下,为低资源语言训练AS2模型。为评估该方法,我们构建了:1)Xtr-WikiQA——基于翻译技术的9种语言WikiQA数据集;2)TyDi-AS2——覆盖8种类型学差异语言的7万余问题多语AS2数据集。我们在Xtr-WikiQA和TyDi-AS2上开展广泛实验,采用多种教师模型、单语及多语预训练语言模型(PLM)作为学生模型,进行单语与多语训练。结果表明,CLKD在同等标注数据量下,性能优于或媲美监督微调结合机器翻译与教师模型的方案。该方法有望为低资源语言构建更强的AS2模型,而TyDi-AS2将成为目前最大的多语AS2数据集,促进研究社区的进一步探索。