The LLM Data Auditor: A Metric-oriented Survey on Quality and Trustworthiness in Evaluating Synthetic Data

Kaituo Zhang,Mingzhi Hu,Hoang Anh Duy Le,Fariha Kabir Torsha,Zhimeng Jiang,Minh Khai Bui,Chia-Yuan Chang,Yu-Neng Chuang,Zhen Xiong,Ying Lin,Guanchu Wang,Na Zou

Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.

翻译：大型语言模型（LLM）已成为跨多种模态生成数据的强大工具。通过将数据从稀缺资源转化为可控资产，LLM缓解了模型训练、评估和系统迭代过程中因真实数据获取成本而产生的瓶颈。然而，确保LLM生成的合成数据具有高质量仍然是一个关键挑战。现有研究主要集中于生成方法，对生成数据质量的直接关注有限。此外，大多数研究局限于单一模态，缺乏跨不同数据类型的统一视角。为弥补这一空白，我们提出了**LLM数据审计师框架**。在该框架中，我们首先描述了LLM如何用于生成六种不同模态的数据。更重要的是，我们从质量和可信度两个维度系统化地分类了评估合成数据的内在度量指标。这一方法将评估重点从依赖下游任务表现的外在评估，转向数据本身的内在属性。利用该评估体系，我们分析了各模态代表性生成方法的实验评估，并指出了当前评估实践中存在的显著不足。基于这些发现，我们为学界提出了改进数据生成评估的具体建议。最后，该框架概述了合成数据在不同模态中实际应用的方法论。