The availability of large, high-quality datasets has been one of the main drivers of recent progress in question answering (QA). Such annotated datasets however are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained, thus avoiding costly annotation. Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines, bridges nearly 60% of the gap between an English-only baseline and a fully supervised upper bound trained on almost 50,000 hand labeled examples, and always leads to substantial improvements compared to fine-tuning a QA model directly on labeled examples in low resource settings. Experiments on the TyDiQA-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation.
翻译:大规模高质量数据集的可用性是近期问答技术取得进展的主要驱动力之一。然而,此类标注数据集不仅难以获取且成本高昂,且除英语外的其他语言中极少存在,导致问答技术无法惠及资源稀缺的语言。替代大规模单语训练数据集构建的方案是在少样本学习场景下利用预训练语言模型。我们提出的QAmeleon方法通过预训练语言模型自动生成多语言数据,并基于这些数据训练问答模型,从而避免昂贵的标注成本。仅需每语言5个样本即可通过提示调优实现数据合成,其准确率优于基于翻译的基线方法,填补了英语仅限基线模型与在近5万条人工标注样本上训练的完全监督上限模型之间近60%的性能差距,且在低资源场景下直接对标注样本微调问答模型时总能实现显著提升。在TyDiQA-GoldP和MLQA基准上的实验表明,面向数据合成的少样本提示调优方法具有跨语言可扩展性,是大规模标注的有效替代方案。