Insurers usually turn to generalized linear models for modeling claim frequency and severity data. Due to their success in other fields, machine learning techniques are gaining popularity within the actuarial toolbox. Our paper contributes to the literature on frequency-severity insurance pricing with machine learning via deep learning structures. We present a benchmark study on four insurance data sets with frequency and severity targets in the presence of multiple types of input features. We compare in detail the performance of: a generalized linear model on binned input data, a gradient-boosted tree model, a feed-forward neural network (FFNN), and the combined actuarial neural network (CANN). The CANNs combine a baseline prediction established with a GLM and GBM, respectively, with a neural network correction. We explain the data preprocessing steps with specific focus on the multiple types of input features typically present in tabular insurance data sets, such as postal codes, numeric and categorical covariates. Autoencoders are used to embed the categorical variables into the neural network, and we explore their potential advantages in a frequency-severity setting. Model performance is evaluated not only on out-of-sample deviance but also using statistical and calibration performance criteria and managerial tools to get more nuanced insights. Finally, we construct global surrogate models for the neural nets' frequency and severity models. These surrogates enable the translation of the essential insights captured by the FFNNs or CANNs to GLMs. As such, a technical tariff table results that can easily be deployed in practice.
翻译:保险公司通常采用广义线性模型对索赔频率与严重程度数据进行建模。由于机器学习技术在其他领域的成功应用,其在精算工具箱中的普及度日益提升。本文通过深度学习架构,为基于频率-严重程度的保险定价机器学习研究领域作出贡献。我们在包含多种输入特征类型的场景下,对四个具有频率与严重程度目标的保险数据集进行了基准研究。我们详细比较了以下模型的性能:基于分箱输入数据的广义线性模型、梯度提升树模型、前馈神经网络(FFNN)以及组合精算神经网络(CANN)。CANN模型分别将基于GLM和GBM建立的基线预测与神经网络修正项相结合。我们详细阐述了数据预处理步骤,特别关注表格型保险数据集中典型存在的多种输入特征类型,如邮政编码、数值型与分类型协变量。研究采用自编码器将分类变量嵌入神经网络,并探讨了其在频率-严重程度建模场景中的潜在优势。模型性能评估不仅基于样本外偏差,还结合统计与校准性能标准及管理工具以获得更精细的洞察。最后,我们为神经网络的频率与严重程度模型构建了全局代理模型。这些代理模型能够将FFNN或CANN捕捉的核心信息转化为GLM可解释的形式,从而生成可直接应用于实践的技术费率表。