This paper aims to explore the potential of leveraging Large Language Models (LLMs) for data augmentation in crosslingual commonsense reasoning datasets, where the available training data is extremely limited. To achieve this, we employ several LLMs including Dolly-v2, StableVicuna, ChatGPT, and GPT-4 to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we assess the effectiveness of fine-tuning smaller crosslingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translating the English-generated data into the target languages. Our experiments reveal the overall advantages of incorporating data generated by LLMs. Training on synthetic data generated by GPT-4, whether English or multilingual, improves performance consistently compared to the baseline. Other models also exhibit an overall increase in performance, however, their effectiveness decreases in some settings. We also ask native speakers to evaluate the naturalness and logical soundness of the generated examples for different languages. Human evaluation reveals that LLMs like ChatGPT and GPT-4 excel at generating natural text in most languages, except a few such as Tamil. Moreover, ChatGPT trails behind in generating plausible alternatives in comparison to the original dataset, while GPT-4 demonstrates competitive logic consistency in the synthesised data.
翻译:本文旨在探索利用大型语言模型(LLMs)对跨语言常识推理数据集进行数据增强的潜力,这些数据集的可用训练样本极为有限。为此,我们采用包括Dolly-v2、StableVicuna、ChatGPT和GPT-4在内的多个LLMs,对XCOPA、XWinograd和XStoryCloze三个数据集进行增强。随后,我们评估了使用合成数据微调较小跨语言模型mBERT和XLMR的效果。我们比较了使用英语和目标语言生成的数据进行训练的性能,以及将英语生成数据翻译成目标语言的效果。实验揭示了LLMs生成数据的整体优势。与基线相比,使用GPT-4生成的合成数据(无论英语还是多语言)进行训练能够持续提升性能。其他模型也表现出整体性能提升,但在某些设置下有效性有所下降。我们还邀请母语者评估生成示例在不同语言中的自然性和逻辑合理性。人工评估显示,ChatGPT和GPT-4等LLMs在大多数语言中能生成自然文本,但泰米尔语等少数语言除外。此外,与原始数据集相比,ChatGPT在生成合理备选方案方面表现较弱,而GPT-4在合成数据中展现出具有竞争力的逻辑一致性。