Efficient frequentist fractional polynomials for skewed dose-response and survival data: a variance-reducing alternative to OLS-FP

from arxiv, Revised and retitled version prepared for journal submission; applied biostatistical framing strengthened, primary-biliary-cirrhosis confirmation added, and supplementary theory separated. 25 pages, 2 figures, 5 tables

Fractional polynomials (FP) are a standard tool for modelling nonlinear dose-response and covariate effects, implemented in the widely used mfp package. The conventional FP fit estimates its coefficients by ordinary least squares (OLS-FP), which is statistically inefficient when the regression errors are skewed or heavy-tailed, a common situation for survival times, concentrations and biomarkers. We present a drop-in replacement that keeps the identical FP model and design but estimates the coefficients with a moment-based score tuned to the residual skewness and kurtosis, giving a closed-form efficiency factor g2 = 1 - gamma3^2/(2+gamma4) relative to OLS-FP. Across skewed error laws the method reduces slope-coefficient variance by 10-20% for mildly skewed errors and up to roughly 60% for heavy-tailed log-normal errors, at realistic sample sizes, while keeping confidence-interval coverage close to nominal, and it reverts exactly to OLS-FP under symmetry, so it is never harmful when no gain is available. On the German Breast Cancer Study Group cohort it narrows the tumour-size confidence interval by 26% (bootstrap variance ratio 0.53 against the predicted 0.56), and a primary-biliary-cirrhosis cohort reproduces the gain. The estimator is closed-form, runs in milliseconds, and is released as a reproducible R package (pmm_fp in EstemPMM) with a one-command replication bundle; its core variance identity is machine-checked in Lean 4.

翻译：分数多项式（FP）是建模非线性剂量-响应及协变量效应的标准工具，已广泛应用于mfp软件包。传统FP拟合通过普通最小二乘法（OLS-FP）估计系数，当回归误差呈现偏斜或重尾分布时（常见于生存时间、浓度及生物标志物数据），其统计效率较低。我们提出一种即插即用的替代方案，该方案保持与OLS-FP完全相同的FP模型与设计，但通过基于残差偏度与峰度调整的矩估计量估计系数，从而得到相对于OLS-FP的封闭形式效率因子g²=1−γ₃²/(2+γ₄)。在偏斜误差分布下，该方法可将斜率系数方差降低：轻度偏斜误差时降低10–20%，重尾对数正态误差时（实际样本量下）最高可降低约60%，同时将置信区间覆盖率维持在名义水平附近。在对称误差条件下，该方法完全退化为OLS-FP，因此在无增益时绝不会造成损害。在德国乳腺癌研究组队列中，该方法将肿瘤大小置信区间收窄26%（bootstrap方差比0.53，预测值0.56）；原发性胆汁性肝硬化队列亦复现了该增益。该估计量具有封闭形式，运行时间仅为毫秒级，并作为可复现的R软件包（EstemPMM中的pmm_fp）发布，附带单命令复现工具包；其核心方差恒等式已通过Lean 4进行机器验证。