A Deep Learning Approach for Overall Survival Prediction in Lung Cancer with Missing Values

In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges. We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time. We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used.

翻译：在肺癌研究领域，尤其是在总生存期（OS）分析中，人工智能（AI）扮演着关键角色并具有特定目标。鉴于医学领域中普遍存在的数据缺失问题，我们的首要目标是开发一种能够动态处理此类缺失数据的AI模型。此外，我们旨在充分利用所有可获取的数据，通过在我们的AI模型中嵌入一项在其他AI任务中不常用的专门技术，有效分析既包括经历了目标事件的未删失患者，也包括未经历事件的删失患者。通过实现这些目标，我们的模型旨在为非小细胞肺癌（NSCLC）患者提供精确的OS预测，从而克服这些重大挑战。我们提出了一种在NSCLC背景下处理缺失值的生存分析新方法，该方法利用Transformer架构的优势，仅基于可用特征进行分析，无需任何插补策略。更具体地说，该模型通过调整其特征嵌入和掩码自注意力机制来掩码缺失数据并充分利用可用数据，从而将Transformer架构适配于表格数据。通过使用为OS专门设计的损失函数，该模型能够同时考虑删失和未删失患者，以及风险随时间的变化。我们将我们的方法与结合了不同插补策略的先进生存分析模型进行了比较。我们使用不同的时间粒度评估了6年期间获得的结果，在1个月、1年和2年的时间单位下，分别获得了71.97、77.58和80.72的Ct-index（C指数的时间依赖性变体），其性能优于所有最先进的方法，且与所使用的插补方法无关。