A Deep Learning Approach for Overall Survival Prediction in Lung Cancer with Missing Values

In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges. We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time. We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used.

翻译：在肺癌研究领域，尤其在总生存期（OS）分析中，人工智能（AI）凭借其特定目标发挥着关键作用。鉴于医学领域普遍存在的数据缺失问题，我们的首要目标是开发能够动态处理缺失数据的AI模型。此外，我们旨在充分利用所有可获取数据，通过在我们的AI模型中嵌入一种在其他AI任务中不常用的专门技术，有效分析已发生目标事件的未删失患者与未发生事件的删失患者。通过实现这些目标，我们的模型旨在为非小细胞肺癌（NSCLC）患者提供精确的OS预测，从而克服这些重大挑战。我们提出了一种针对存在缺失值的NSCLC生存分析新方法，该方法利用Transformer架构的优势，仅基于现有特征进行分析，无需任何插补策略。具体而言，该模型通过调整特征嵌入和掩码自注意力机制来适配表格数据：掩码缺失数据并充分利用现有数据。通过使用为OS量身定制的损失函数，该模型能够同时处理删失与未删失患者，并捕捉随时间变化的风险。我们将该方法与结合不同插补策略的当前最优生存分析模型进行对比。我们评估了6年跨度内不同时间粒度下的结果，在时间单位分别为1个月、1年、2年时获得了C-index的时间依赖变体Ct-index值为71.97、77.58和80.72，无论采用何种插补方法，均优于所有当前最优方法。