Traditionally, the detection of fraudulent insurance claims relies on business rules and expert judgement which makes it a time-consuming and expensive process (\'Oskarsd\'ottir et al., 2022). Consequently, researchers have been examining ways to develop efficient and accurate analytic strategies to flag suspicious claims. Feeding learning methods with features engineered from the social network of parties involved in a claim is a particularly promising strategy (see for example Van Vlasselaer et al. (2016); Tumminello et al. (2023)). When developing a fraud detection model, however, we are confronted with several challenges. The uncommon nature of fraud, for example, creates a high class imbalance which complicates the development of well performing analytic classification models. In addition, only a small number of claims are investigated and get a label, which results in a large corpus of unlabeled data. Yet another challenge is the lack of publicly available data. This hinders not only the development of new methods, but also the validation of existing techniques. We therefore design a simulation machine that is engineered to create synthetic data with a network structure and available covariates similar to the real life insurance fraud data set analyzed in \'Oskarsd\'ottir et al. (2022). Further, the user has control over several data-generating mechanisms. We can specify the total number of policyholders and parties, the desired level of imbalance and the (effect size of the) features in the fraud generating model. As such, the simulation engine enables researchers and practitioners to examine several methodological challenges as well as to test their (development strategy of) insurance fraud detection models in a range of different settings. Moreover, large synthetic data sets can be generated to evaluate the predictive performance of (advanced) machine learning techniques.
翻译:传统上,欺诈性保险索赔的检测依赖于商业规则和专家判断,这使得该过程耗时且昂贵('Oskarsdóttir等人,2022)。因此,研究人员一直在探索开发高效、准确的分析策略以标记可疑索赔的方法。利用从索赔涉及方的社交网络中构建的特征来馈送学习方法,是一种特别有前景的策略(参见例如Van Vlasselaer等人(2016);Tumminello等人(2023))。然而,在开发欺诈检测模型时,我们面临着若干挑战。例如,欺诈事件的罕见性导致了高度的类别不平衡,这使得开发性能良好的分析分类模型变得复杂。此外,只有少数索赔被调查并获得标签,这导致存在大量未标记数据。另一个挑战是公开可用数据的缺乏。这不仅阻碍了新方法的开发,也阻碍了对现有技术的验证。因此,我们设计了一个模拟机器,旨在生成具有网络结构和可用协变量的合成数据,这些数据类似于'Oskarsdóttir等人(2022)分析的真实保险欺诈数据集。此外,用户可以控制多种数据生成机制。我们可以指定投保人和参与方的总数、期望的不平衡水平以及欺诈生成模型中特征的(效应大小)。因此,该模拟引擎使研究人员和从业者能够检验多种方法论挑战,并在一系列不同设置下测试其保险欺诈检测模型(的开发策略)。此外,可以生成大型合成数据集,以评估(先进的)机器学习技术的预测性能。