Code Lifespan Survival Analysis (CLSA): Predicting the Survival of Source Code Lines Using AST-Aware Mining

Context: Predicting which source lines will be deleted - and when - matters for maintenance, technical debt, and review prioritization. Existing MSR approaches work at file or method granularity, masking individual-statement risk. Objective: We introduce Code Lifespan Survival Analysis (CLSA), the first framework to model code survival at individual-line granularity. CLSA treats each line as a right-censored subject and estimates deletion risk from structural, contextual, and temporal covariates; its strongest predictors are computable statically from one file (AST structure plus line entropy), without version history or bug data. Method: We mine 32.5 million line birth events from 120 open-source TypeScript repositories. A 5-stage bipartite matching pipeline separates true deletions from refactoring noise (migrations and rewrites), preventing 8.3 million false deaths. We fit a Cox Proportional Hazards model with 15 covariates and check robustness via Weibull/Log-Logistic AFT, gamma frailty, and time-stratified landmark models. Results: More than half of all lines are never deleted (Kaplan-Meier median not reached); among deleted lines the median lifespan is 95.7 days. Covariate effects are strongly time-varying, forming three regimes. Line Shannon entropy is moderately protective for new code (HR=0.84, 0-90 days) and strongly protective for mature code (HR=0.36, 365+ days), explaining its proportional-hazards violation. Lines in conditional branches reverse: protective at birth (HR=0.97), a risk factor after 90 days (HR=1.21). Repository identity is the largest factor: a gamma frailty model (variance theta=1.449) raises concordance from 0.586 to 0.666, outweighing every structural covariate. Conclusion: Line-level survival modeling is tractable, yielding interpretable, mostly static risk signals and a calibration recipe for time-conditional risk scoring in IDEs and code review.

翻译：上下文：预测哪些源代码行将被删除——以及何时删除——对于代码维护、技术债务管理和评审优先级排序具有重要意义。现有MSR方法在文件或方法粒度上操作，掩盖了个体语句的风险。目标：我们提出代码生存期生存分析（CLSA），这是首个在单行粒度上建模代码生存的框架。CLSA将每一行视为右删失个体，并从结构、上下文和时间协变量中估计删除风险；其最具预测力的指标可从单个文件中静态计算（基于AST结构和行熵），无需版本历史或缺陷数据。方法：我们从120个开源TypeScript仓库中挖掘了3250万行代码的诞生事件。一个五阶段二分匹配流程将真实删除与重构噪声（迁移和重写）分开，防止了830万次误判删除。我们拟合了包含15个协变量的Cox比例风险模型，并通过Weibull/Log-Logistic加速失效时间模型、伽马脆弱性模型和时间分层界标模型检验了稳健性。结果：超过一半的代码行从未被删除（Kaplan-Meier中位数未达到）；在已被删除的行中，中位生存期为95.7天。协变量效应呈现强烈的时间变化性，形成三种模式。行香农熵对新代码具有中等保护作用（风险比HR=0.84，0-90天），对成熟代码具有强保护作用（HR=0.36，365天以上），这解释了其比例风险违例现象。条件分支中的代码行呈现出反转效应：诞生时具有保护作用（HR=0.97），90天后变为风险因素（HR=1.21）。仓库身份是最重要的因素：伽马脆弱性模型（方差theta=1.449）将一致性指数从0.586提升至0.666，超过所有结构协变量的影响。结论：行级生存建模是可行的，可生成可解释的、基本静态的风险信号，并为IDE和代码评审中的时间条件风险评分提供校准方案。