On clustering levels of a hierarchical categorical risk factor

Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. The industry code in a workers' compensation insurance product is a prime example hereof. We commonly rely on methods such as the random effects approach (Campo and Antonio, 2023) to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down (PHiRAT) algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers' compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings (Mikolov et al., 2013; Cer et al., 2018) to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. We show that our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, when estimating the technical premium of the insurance product under study as a function of the clustered hierarchical risk factor, we obtain a better differentiation between high-risk and low-risk companies.

翻译：处理具有大量类别的名义协变量对统计和机器学习技术而言充满挑战。当名义变量具有层次结构时，这一问题会进一步加剧。工伤保险产品中的行业代码即为典型例证。我们通常依赖随机效应方法（Campo和Antonio，2023）等途径将这些协变量纳入预测模型。然而在某些情况下，即便随机效应方法也可能遭遇估计问题。我们提出数据驱动的层次化风险因子自适应自上而下分区（PHiRAT）算法，通过聚合层次结构中各层级的相似类别，将层次化风险因子简化为其本质特征。我们采用自上而下的方式，设计了多种特征来刻画层次中特定层级的类别轮廓。在我们的工伤保险案例研究中，我们通过观测的损失率和索赔频率来表征行业的风险轮廓。此外，我们利用嵌入（Mikolov等，2013；Cer等，2018）对承保公司经济活动的文本描述进行编码。这些特征随后作为聚类算法的输入以聚合相似类别。研究表明，我们的方法能显著减少类别数量，并生成可推广至样本外数据的聚类结果。此外，当以聚类后的层次化风险因子为函数估计研究保险产品的技术保费时，我们获得了高危与低风险企业间更好的区分度。