OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

from arxiv, 24 pages, 7 figures, 4 tables. Code, data, and trained model weights: https://github.com/span-ai-labs/oncotraj. Python package: pip install oncotraj. Dataset: https://huggingface.co/datasets/span-ai-labs/oncotraj-v1

Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.

翻译：针对EGFR突变非小细胞肺癌（NSCLC）的一线奥希替尼耐药，是可预测性克隆进化在治疗压力下的经典范例，然而目前尚无可用公共基准来训练或评估针对相应纵向患者轨迹的计算模型。我们提出OncoTraj，这是一个由813例接受一线奥希替尼治疗的EGFR突变NSCLC患者组成的公共基准，数据来自三个真实世界的临床基因组学来源：MSK-CHORD（672例）、AACR Project GENIE BPC NSCLC（34例）以及FLAURA分子耐药补充资料（107例）。OncoTraj定义了三个固定任务：（A）基于12个月固定界标进行进展的二元分类，（B）以天为单位对首次进展时间进行回归，以及（C）对主要耐药机制进行六类分类。我们发布了统一协调的数据集、经审计无数据泄露保证的患者级训练/验证/测试划分、开源评估框架，以及六条参考基线，涵盖多数类预测器、逻辑回归、随机森林、XGBoost、长短期记忆网络和多任务Transformer。在v1版本的单一时间点快照特征下，即使在清洁的源内评估中，所有任务均未超越随机水平：这种在所有模型类别中一致的上限瓶颈，将限制归因于输入模态（单次组织二代测序而非连续ctDNA），而非算法本身。该基准确实复现了文献中一致的可重复关联：TP53共突变使整个队列的12个月进展率从29%上升至59%。OncoTraj建立了可复现、经审计的基线，并将模态限制转化为针对富集连续ctDNA数据的v2版本的具体设计要求。