Automatic methods for generating and gathering linguistic data have proven effective for fine-tuning Language Models (LMs) in languages less resourced than English. Still, while there has been emphasis on data quantity, less attention has been given to its quality. In this work, we investigate the impact of human intervention on machine-generated data when fine-tuning dialogical models. In particular, we study (1) whether post-edited dialogues exhibit higher perceived quality compared to the originals that were automatically generated; (2) whether fine-tuning with post-edited dialogues results in noticeable differences in the generated outputs; and (3) whether post-edited dialogues influence the outcomes when considering the parameter size of the LMs. To this end we created HED-IT, a large-scale dataset where machine-generated dialogues are paired with the version post-edited by humans. Using both the edited and unedited portions of HED-IT, we fine-tuned three different sizes of an LM. Results from both human and automatic evaluation show that the different quality of training data is clearly perceived and it has an impact also on the models trained on such data. Additionally, our findings indicate that larger models are less sensitive to data quality, whereas this has a crucial impact on smaller models. These results enhance our comprehension of the impact of human intervention on training data in the development of high-quality LMs.
翻译:自动生成与收集语言数据的方法已被证明能有效微调资源少于英语的语言模型。然而,尽管现有研究多关注数据数量,对其质量的探讨仍显不足。本研究探讨了在微调对话模型时,人工干预对机器生成数据的影响。具体而言,我们研究了以下问题:(1)经人工编辑的对话是否比原始自动生成的对话具有更高的感知质量;(2)使用编辑后对话进行微调是否会导致生成输出的显著差异;(3)在考虑语言模型参数规模时,编辑后对话是否会影响结果。为此,我们构建了HED-IT——一个大规模数据集,其中机器生成的对话均配有对应的人工编辑版本。利用HED-IT中编辑与未编辑两部分数据,我们微调了三种不同规模的语言模型。人工与自动评估结果均表明,训练数据质量的差异能被清晰感知,并对基于这些数据训练的模型产生实质性影响。此外,研究发现较大模型对数据质量的敏感性较低,而数据质量对较小模型具有关键性影响。这些结果深化了我们对高质量语言模型开发过程中人工干预训练数据所起作用的认知。