Large Language Models (LLMs), with gradually improving reading comprehension and reasoning capabilities, are being applied to a range of complex language tasks, including the automatic generation of language data for various purposes. However, research on applying LLMs for automatic data generation in low-resource languages like Vietnamese is still underdeveloped and lacks comprehensive evaluation. In this paper, we explore the use of LLMs for automatic data generation for the Vietnamese fact-checking task, which faces significant data limitations. Specifically, we focus on fact-checking data where claims are synthesized from multiple evidence sentences to assess the information synthesis capabilities of LLMs. We develop an automatic data construction process using simple prompt techniques on LLMs and explore several methods to improve the quality of the generated data. To evaluate the quality of the data generated by LLMs, we conduct both manual quality assessments and performance evaluations using language models. Experimental results and manual evaluations illustrate that while the quality of the generated data has significantly improved through fine-tuning techniques, LLMs still cannot match the data quality produced by humans.
翻译:随着阅读理解与推理能力的逐步提升,大语言模型(LLMs)正被应用于一系列复杂语言任务,包括为不同目的自动生成语言数据。然而,针对越南语等低资源语言应用LLMs进行自动数据生成的研究仍不充分,且缺乏全面评估。本文探讨了利用LLMs为越南语事实核查任务自动生成数据的方法,该任务目前面临显著的数据限制。具体而言,我们专注于那些通过综合多个证据句合成主张的事实核查数据,以评估LLMs的信息综合能力。我们开发了一套基于LLMs简单提示技术的自动数据构建流程,并探索了多种提升生成数据质量的方法。为评估LLMs生成数据的质量,我们同时进行了人工质量评估和基于语言模型的性能评估。实验结果表明,尽管通过微调技术生成数据的质量已有显著提升,但LLMs仍无法达到人类生成的数据质量水平。