Overview
Code Lifespan Survival Analysis (CLSA) is a framework designed to predict the deletion of individual source code lines and their deletion timing. This approach models each line as a right-censored subject, estimating deletion risk based on structural, contextual, and temporal covariates. The framework emphasizes using predictors computable statically from a single file, such as Abstract Syntax Tree (AST) structure and line entropy, without requiring version history or bug data.
Research Context
Existing approaches in Mining Software Repositories (MSR) typically operate at the granularity of files or methods, which can obscure the risk associated with individual statements. Predicting the deletion of specific source lines and their timing holds relevance for software maintenance, managing technical debt, and prioritizing code reviews. CLSA aims to address this gap by focusing on individual-line granularity.
Approach
The research involved mining 32.5 million line birth events from 120 open-source TypeScript repositories. To differentiate true deletions from refactoring noise, such as migrations and rewrites, a 5-stage bipartite matching pipeline was employed. This pipeline prevented 8.3 million false death events from being categorized as deletions. A Cox Proportional Hazards model was fitted using 15 covariates. The robustness of this model was assessed through the application of Weibull and Log-Logistic Accelerated Failure Time (AFT) models, gamma frailty models, and time-stratified landmark models.
The covariates used for the Cox Proportional Hazards model included static structural and contextual information. The strongest predictors were identified as those computable statically from one file, specifically AST structure and line entropy.
Findings
- More than half of all source code lines analyzed are never deleted, indicating that the Kaplan-Meier median lifespan was not reached.
- Among the lines that are deleted, the median lifespan was observed to be 95.7 days.
- Covariate effects on line survival exhibit strong time-varying characteristics, forming three distinct regimes.
- Line Shannon entropy demonstrated varying protective effects: it was moderately protective for new code (a Hazard Ratio (HR) of 0.84 within 0-90 days) and became strongly protective for mature code (HR of 0.36 after 365+ days). This time-varying effect explains its violation of the proportional hazards assumption.
- Lines located within conditional branches exhibited a reversal in their survival characteristics: initially protective at their birth (HR=0.97), they subsequently became a risk factor after 90 days (HR=1.21).
- Repository identity was identified as the most significant factor influencing code line survival. A gamma frailty model, featuring a variance theta of 1.449, increased concordance from 0.586 to 0.666, indicating its influence outweighed that of every structural covariate.
Why This Matters
The tractability of line-level survival modeling provides interpretable and predominantly static risk signals. These insights offer a calibration recipe for developing time-conditional risk scoring mechanisms for integration into Integrated Development Environments (IDEs) and code review processes.