Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with real-world data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.
翻译:以数据为中心的人工智能方法旨在不修改模型的情况下提升模型性能,且已被证明对模型性能具有正向影响。尽管近期基于合成数据的数据中心型人工智能因其性能提升潜力而备受关注,但长期以来数据中心型人工智能仅通过真实世界数据和公开基准数据集进行验证。就此而言,数据中心型人工智能仍高度依赖真实世界数据,而基于合成数据的模型验证尚未得到充分开展。面对上述挑战,我们提出疑问:被誉有积极影响的数据中心型人工智能方法论——数据质量控制(噪声注入与数据平衡)——在仅使用合成数据训练的模型中是否同样展现出这种积极影响?为解答此问题,我们基于语法错误修正任务,对分别使用合成数据与真实世界数据训练的模型开展了对比分析。实验结果表明,数据质量控制方法对采用真实世界数据训练的模型具有正向影响(与现有研究结论一致),但对仅使用合成数据训练的模型却呈现出负面影响。