Insurers usually turn to generalized linear models for modelling claim frequency and severity data. Due to their success in other fields, machine learning techniques are gaining popularity within the actuarial toolbox. Our paper contributes to the literature on frequency-severity insurance pricing with machine learning via deep learning structures. We present a benchmark study on four insurance data sets with frequency and severity targets in the presence of multiple types of input features. We compare in detail the performance of: a generalized linear model on binned input data, a gradient-boosted tree model, a feed-forward neural network (FFNN), and the combined actuarial neural network (CANN). Our CANNs combine a baseline prediction established with a GLM and GBM, respectively, with a neural network correction. We explain the data preprocessing steps with specific focus on the multiple types of input features typically present in tabular insurance data sets, such as postal codes, numeric and categorical covariates. Autoencoders are used to embed the categorical variables into the neural network and we explore their potential advantages in a frequency-severity setting. Finally, we construct global surrogate models for the neural nets' frequency and severity models. These surrogates enable the translation of the essential insights captured by the FFNNs or CANNs to GLMs. As such, a technical tariff table results that can easily be deployed in practice.
翻译:保险公司通常采用广义线性模型对索赔频率和严重度数据进行建模。由于机器学习技术在其他领域的成功应用,其精算工具箱中的地位日益凸显。本文通过深度学习架构,为基于频率-严重度的机器学习保险定价文献做出贡献。我们对包含频率和严重度目标变量的四个保险数据集进行了基准研究,这些数据涉及多种输入特征类型。我们详细比较了以下模型的性能:基于分箱输入数据的广义线性模型、梯度提升树模型、前馈神经网络(FFNN)以及组合精算神经网络(CANN)。我们的CANN分别将基于GLM和GBM建立的基线预测与神经网络修正相结合。我们解释了数据预处理步骤,特别关注表格型保险数据集中常见的多类型输入特征(如邮政编码、数值型和类别型协变量)。采用自编码器将类别变量嵌入神经网络,并探索其在频率-严重度场景中的潜在优势。最终,我们为神经网络的频率和严重度模型构建全局替代模型。这些替代模型能够将FFNN或CANN捕获的核心洞察转化为GLM,从而生成易于实际部署的技术费率表。