Insurance pricing on price comparison websites via reinforcement learning

The emergence of price comparison websites (PCWs) has presented insurers with unique challenges in formulating effective pricing strategies. Operating on PCWs requires insurers to strike a delicate balance between competitive premiums and profitability, amidst obstacles such as low historical conversion rates, limited visibility of competitors' actions, and a dynamic market environment. In addition to this, the capital intensive nature of the business means pricing below the risk levels of customers can result in solvency issues for the insurer. To address these challenges, this paper introduces reinforcement learning (RL) framework that learns the optimal pricing policy by integrating model-based and model-free methods. The model-based component is used to train agents in an offline setting, avoiding cold-start issues, while model-free algorithms are then employed in a contextual bandit (CB) manner to dynamically update the pricing policy to maximise the expected revenue. This facilitates quick adaptation to evolving market dynamics and enhances algorithm efficiency and decision interpretability. The paper also highlights the importance of evaluating pricing policies using an offline dataset in a consistent fashion and demonstrates the superiority of the proposed methodology over existing off-the-shelf RL/CB approaches. We validate our methodology using synthetic data, generated to reflect private commercially available data within real-world insurers, and compare against 6 other benchmark approaches. Our hybrid agent outperforms these benchmarks in terms of sample efficiency and cumulative reward with the exception of an agent that has access to perfect market information which would not be available in a real-world set-up.

翻译：价格比较网站（PCWs）的出现为保险公司制定有效定价策略带来了独特挑战。在PCW平台上运营时，保险公司需要在竞争性保费与盈利能力之间取得微妙平衡，同时面临历史转化率低、竞争对手行为可见性有限以及市场环境动态变化等多重障碍。此外，保险业务的资本密集型特性意味着，若定价低于客户风险水平，可能导致保险公司出现偿付能力问题。为解决这些挑战，本文提出一种强化学习（RL）框架，通过融合基于模型和无模型的方法来学习最优定价策略。基于模型的组件用于在离线环境下训练智能体，避免冷启动问题；随后采用无模型算法以情境匪徒（CB）形式动态更新定价策略，最大化期望收益。该方法能够快速适应不断变化的市场动态，同时提升算法效率与决策可解释性。本文还强调了使用离线数据集以一致方式评估定价策略的重要性，并证明了所提方法相较于现有现成RL/CB方法的优越性。我们利用合成数据验证了方法论，该数据模拟了现实保险公司中可获取的商业私有数据，并与六种基准方法进行了对比。除使用完美市场信息（实际场景中无法获取）的智能体外，我们的混合智能体在样本效率和累积奖励方面均优于这些基准方法。