An engine to simulate insurance fraud network data

Traditionally, the detection of fraudulent insurance claims relies on business rules and expert judgement which makes it a time-consuming and expensive process (\'Oskarsd\'ottir et al., 2022). Consequently, researchers have been examining ways to develop efficient and accurate analytic strategies to flag suspicious claims. Feeding learning methods with features engineered from the social network of parties involved in a claim is a particularly promising strategy (see for example Van Vlasselaer et al. (2016); Tumminello et al. (2023)). When developing a fraud detection model, however, we are confronted with several challenges. The uncommon nature of fraud, for example, creates a high class imbalance which complicates the development of well performing analytic classification models. In addition, only a small number of claims are investigated and get a label, which results in a large corpus of unlabeled data. Yet another challenge is the lack of publicly available data. This hinders not only the development of new methods, but also the validation of existing techniques. We therefore design a simulation machine that is engineered to create synthetic data with a network structure and available covariates similar to the real life insurance fraud data set analyzed in \'Oskarsd\'ottir et al. (2022). Further, the user has control over several data-generating mechanisms. We can specify the total number of policyholders and parties, the desired level of imbalance and the (effect size of the) features in the fraud generating model. As such, the simulation engine enables researchers and practitioners to examine several methodological challenges as well as to test their (development strategy of) insurance fraud detection models in a range of different settings. Moreover, large synthetic data sets can be generated to evaluate the predictive performance of (advanced) machine learning techniques.

翻译：传统上，保险欺诈索赔的检测依赖于业务规则和专家判断，这使得该过程既耗时又昂贵（'Oskarsd'ottir et al., 2022）。因此，研究人员一直在探索开发高效且准确的分析策略来标记可疑索赔。利用从索赔涉及各方社交网络中提取的特征来训练学习方法，是一种尤为有前景的策略（例如，参见 Van Vlasselaer et al. (2016); Tumminello et al. (2023)）。然而，在开发欺诈检测模型时，我们面临着若干挑战。例如，欺诈的不常见性导致了高度类别不平衡，这阻碍了高性能分析分类模型的开发。此外，仅有少量索赔被调查并获得标签，从而产生了大量未标注数据。另一个挑战是缺乏公开可用的数据。这不仅阻碍了新方法的开发，也妨碍了对现有技术的验证。为此，我们设计了一个模拟引擎，旨在生成具有网络结构和可用协变量的合成数据，这些数据类似于'Oskarsd'ottir et al. (2022) 中分析的真实保险欺诈数据集。此外，用户还能控制多个数据生成机制。我们可以指定保单持有人和参与方的总数、期望的不平衡程度，以及欺诈生成模型中特征（的效应大小）。因此，该模拟引擎使研究人员和实践者能够研究若干方法论挑战，并在多种不同设置下测试其保险欺诈检测模型（的开发策略）。此外，还可以生成大规模合成数据集，以评估（先进）机器学习技术的预测性能。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日