A Causal Machine Learning Framework for Treatment Personalization in Clinical Trials: Application to Ulcerative Colitis

Randomized controlled trials estimate average treatment effects, but treatment response heterogeneity motivates personalized approaches. A critical question is whether statistically detectable heterogeneity translates into improved treatment decisions -- these are distinct questions that can yield contradictory answers. We present a modular causal machine learning framework that evaluates each question separately: permutation importance identifies which features predict heterogeneity, best linear predictor (BLP) testing assesses statistical significance, and doubly robust policy evaluation measures whether acting on the heterogeneity improves patient outcomes. We apply this framework to patient-level data from the UNIFI maintenance trial of ustekinumab in ulcerative colitis, comparing placebo, standard-dose ustekinumab every 12 weeks, and dose-intensified ustekinumab every 8 weeks, using cross-fitted X-learner models with baseline demographics, medication history, week-8 clinical scores, laboratory biomarkers, and video-derived endoscopic features. BLP testing identified strong associations between endoscopic features and treatment effect heterogeneity for ustekinumab versus placebo, yet doubly robust policy evaluation showed no improvement in expected remission from incorporating endoscopic features, and out-of-fold multi-arm evaluation showed worse performance. Diagnostic comparison of prognostic contribution against policy value revealed that endoscopic scores behaved as disease severity markers -- improving outcome prediction in untreated patients but adding noise to treatment selection -- while clinical variables (fecal calprotectin, age, CRP) captured the decision-relevant variation. These results demonstrate that causal machine learning applications to clinical trials should include policy-level evaluation alongside heterogeneity testing.

翻译：随机对照试验估计平均治疗效果，但治疗反应异质性推动了个体化方法的发展。一个关键问题是统计上可检测的异质性是否转化为改进的治疗决策——这是两个不同的问题，可能产生矛盾的答案。我们提出了一个模块化的因果机器学习框架，分别评估每个问题：置换重要性识别哪些特征能预测异质性，最佳线性预测器（BLP）检验评估统计显著性，双重稳健策略评估则衡量基于异质性采取行动是否能改善患者结局。我们将该框架应用于乌司奴单抗治疗溃疡性结肠炎的UNIFI维持试验的患者水平数据，比较安慰剂、每12周标准剂量乌司奴单抗和每8周剂量强化乌司奴单抗，使用交叉拟合的X-learner模型，纳入基线人口统计学、用药史、第8周临床评分、实验室生物标志物和视频衍生的内镜特征。BLP检验发现内镜特征与乌司奴单抗对比安慰剂的治疗效果异质性之间存在强关联，但双重稳健策略评估显示纳入内镜特征并未改善预期缓解率，且折叠外多臂评估显示性能更差。预后贡献与策略价值的诊断比较揭示，内镜评分表现为疾病严重程度标志物——能改善未治疗患者的结局预测，但为治疗选择增加了噪声——而临床变量（粪便钙卫蛋白、年龄、C反应蛋白）捕捉了与决策相关的变异。这些结果表明，将因果机器学习应用于临床试验时，应在异质性检验之外同时包含策略层面的评估。