With the growing demand for synthetic data to address contemporary issues in machine learning, such as data scarcity, data fairness, and data privacy, having robust tools for assessing the utility and potential privacy risks of such data becomes crucial. SynthEval, a novel open-source evaluation framework distinguishes itself from existing tools by treating categorical and numerical attributes with equal care, without assuming any special kind of preprocessing steps. This~makes it applicable to virtually any synthetic dataset of tabular records. Our tool leverages statistical and machine learning techniques to comprehensively evaluate synthetic data fidelity and privacy-preserving integrity. SynthEval integrates a wide selection of metrics that can be used independently or in highly customisable benchmark configurations, and can easily be extended with additional metrics. In this paper, we describe SynthEval and illustrate its versatility with examples. The framework facilitates better benchmarking and more consistent comparisons of model capabilities.
翻译:随着合成数据在应对机器学习领域数据稀缺、数据公平性及数据隐私等当代问题中的需求日益增长,开发用于评估此类数据效用及潜在隐私风险的稳健工具变得至关重要。SynthEval作为一种新型开源评估框架,其独特之处在于平等对待类别属性和数值属性,无需假设任何特殊的预处理步骤,这使得它几乎适用于任何表格记录形式的合成数据集。该工具利用统计和机器学习技术全面评估合成数据的保真度与隐私保护完整性。SynthEval集成了多种可选指标,既可独立使用,也可在高度可定制的基准测试配置中组合应用,并支持便捷地扩展新增指标。本文详细描述了SynthEval框架,并通过实例展示了其多功能性。该框架有助于改进模型能力的基准测试,并促进更一致的对比分析。