Background: High-dimensional genomic data exhibit strong group correlation structures that challenge conventional feature selection methods, which often assume feature independence or rely on pre-defined pathways and are sensitive to outliers and model misspecification. Methods: We propose the Dorfman screening framework, a multi-stage procedure that forms data-driven variable groups via hierarchical clustering, performs group and within-group hypothesis testing, and refines selection using elastic net or adaptive elastic net. Robust variants incorporate OGK-based covariance estimation, rank-based correlation, and Huber-weighted regression to handle contaminated and non-normal data. Results: In simulations, Dorfman-Sparse-Adaptive-EN performed best under normal conditions, while Robust-OGK-Dorfman-Adaptive-EN showed clear advantages under data contamination, outperforming classical Dorfman and competing methods. Applied to NSCLC gene expression data for trametinib response, robust Dorfman methods achieved the lowest prediction errors and enriched recovery of clinically relevant genes. Conclusions: The Dorfman framework provides an efficient and robust approach to genomic feature selection. Robust-OGK-Dorfman-Adaptive-EN offers strong performance under both ideal and contaminated conditions and scales to ultra-high-dimensional settings, making it well suited for modern genomic biomarker discovery.
翻译:背景:高维基因组数据呈现出强烈的组相关结构,这对传统的特征选择方法提出了挑战。传统方法通常假设特征独立或依赖预定义的生物学通路,并且对异常值和模型设定错误敏感。方法:我们提出了Dorfman筛选框架,这是一种多阶段流程,通过层次聚类形成数据驱动的变量组,执行组内及组间假设检验,并利用弹性网络或自适应弹性网络进行选择优化。其稳健变体整合了基于OGK的协方差估计、基于秩的相关性以及Huber加权回归,以处理受污染和非正态数据。结果:在模拟实验中,Dorfman-Sparse-Adaptive-EN在正态条件下表现最佳,而Robust-OGK-Dorfman-Adaptive-EN在数据受污染条件下显示出明显优势,其性能超越了经典的Dorfman方法及其他竞争方法。应用于非小细胞肺癌曲美替尼响应的基因表达数据时,稳健的Dorfman方法实现了最低的预测误差,并富集恢复了具有临床相关性的基因。结论:Dorfman框架为基因组特征选择提供了一种高效且稳健的方法。Robust-OGK-Dorfman-Adaptive-EN在理想条件和受污染条件下均表现出色,并能扩展到超高维场景,因此非常适用于现代基因组生物标志物的发现。