Actuarial ratemaking depends on high-quality data, yet access to such data is often limited by the cost of obtaining new data, privacy concerns, etc. In this paper, we explore synthetic-data generation as a potential solution to these issues. In addition to generative methods previously studied in the actuarial literature, we explore and benchmark another class of approaches based on Multivariate Imputation by Chained Equations (MICE). In a comparative study using an open-source dataset, MICE-based models are evaluated against other generative models like Variational Autoencoders and Conditional Tabular Generative Adversarial Networks. We assess how well synthetic data preserves the original marginal distributions of variables as well as the multivariate relationships among covariates. The consistency between Generalized Linear Models (GLMs) trained on synthetic data with GLMs trained on the original data is also investigated. Furthermore, we assess the ease of use of each generative approach and study the impact of generically augmenting original data with synthetic data on the performance of GLMs for predicting claim counts. Our results highlight the potential of MICE-based methods in creating high-fidelity tabular data while offering lower implementation complexity compared to deep generative models.
翻译:精算费率厘定依赖于高质量数据,然而获取此类数据常受限于新数据采集成本、隐私问题等因素。本文探讨了合成数据生成作为应对这些问题的潜在解决方案。除了精算文献中已研究的生成方法外,我们探索并基准测试了另一类基于链式方程多重插补(MICE)的方法。在使用开源数据集的比较研究中,基于MICE的模型与变分自编码器、条件表格生成对抗网络等其他生成模型进行了对比评估。我们评估了合成数据在多大程度上保留了原始变量的边缘分布以及协变量间的多元关系。同时研究了基于合成数据训练的广义线性模型(GLMs)与基于原始数据训练的GLMs之间的一致性。此外,我们评估了每种生成方法的易用性,并探讨了将合成数据与原始数据通用性混合后对GLMs在预测索赔次数任务中性能的影响。研究结果表明,基于MICE的方法在创建高保真表格数据方面具有潜力,同时与深度生成模型相比具有更低的实现复杂度。