Context: Predicting which source lines will be deleted - and when - matters for maintenance, technical debt, and review prioritization. Existing MSR approaches work at file or method granularity, masking individual-statement risk. Objective: We introduce Code Lifespan Survival Analysis (CLSA), the first framework to model code survival at individual-line granularity. CLSA treats each line as a right-censored subject and estimates deletion risk from structural, contextual, and temporal covariates; its strongest predictors are computable statically from one file (AST structure plus line entropy), without version history or bug data. Method: We mine 32.5 million line birth events from 120 open-source TypeScript repositories. A 5-stage bipartite matching pipeline separates true deletions from refactoring noise (migrations and rewrites), preventing 8.3 million false deaths. We fit a Cox Proportional Hazards model with 15 covariates and check robustness via Weibull/Log-Logistic AFT, gamma frailty, and time-stratified landmark models. Results: More than half of all lines are never deleted (Kaplan-Meier median not reached); among deleted lines the median lifespan is 95.7 days. Covariate effects are strongly time-varying, forming three regimes. Line Shannon entropy is moderately protective for new code (HR=0.84, 0-90 days) and strongly protective for mature code (HR=0.36, 365+ days), explaining its proportional-hazards violation. Lines in conditional branches reverse: protective at birth (HR=0.97), a risk factor after 90 days (HR=1.21). Repository identity is the largest factor: a gamma frailty model (variance theta=1.449) raises concordance from 0.586 to 0.666, outweighing every structural covariate. Conclusion: Line-level survival modeling is tractable, yielding interpretable, mostly static risk signals and a calibration recipe for time-conditional risk scoring in IDEs and code review.
翻译:上下文:预测哪些源代码行将被删除——以及何时删除——对于代码维护、技术债务管理和评审优先级排序具有重要意义。现有MSR方法在文件或方法粒度上操作,掩盖了个体语句的风险。目标:我们提出代码生存期生存分析(CLSA),这是首个在单行粒度上建模代码生存的框架。CLSA将每一行视为右删失个体,并从结构、上下文和时间协变量中估计删除风险;其最具预测力的指标可从单个文件中静态计算(基于AST结构和行熵),无需版本历史或缺陷数据。方法:我们从120个开源TypeScript仓库中挖掘了3250万行代码的诞生事件。一个五阶段二分匹配流程将真实删除与重构噪声(迁移和重写)分开,防止了830万次误判删除。我们拟合了包含15个协变量的Cox比例风险模型,并通过Weibull/Log-Logistic加速失效时间模型、伽马脆弱性模型和时间分层界标模型检验了稳健性。结果:超过一半的代码行从未被删除(Kaplan-Meier中位数未达到);在已被删除的行中,中位生存期为95.7天。协变量效应呈现强烈的时间变化性,形成三种模式。行香农熵对新代码具有中等保护作用(风险比HR=0.84,0-90天),对成熟代码具有强保护作用(HR=0.36,365天以上),这解释了其比例风险违例现象。条件分支中的代码行呈现出反转效应:诞生时具有保护作用(HR=0.97),90天后变为风险因素(HR=1.21)。仓库身份是最重要的因素:伽马脆弱性模型(方差theta=1.449)将一致性指数从0.586提升至0.666,超过所有结构协变量的影响。结论:行级生存建模是可行的,可生成可解释的、基本静态的风险信号,并为IDE和代码评审中的时间条件风险评分提供校准方案。