Detecting copy number alterations (CNAs) from next-generation sequencing data remains challenging, particularly for short segments under noisy conditions. Existing segmentation methods often suffer from high false positive rates or fail to reliably detect short aberrations, especially in low-coverage data. In this study, we propose a modified tail-greedy unbalanced Haar (TGUHm) method that introduces a dual-thresholding strategy to improve segmentation accuracy. The proposed approach effectively suppresses spurious spikes while preserving sensitivity to both short and long CNA segments. Extensive simulation studies under Gaussian and heavy-tailed noise demonstrate that TGUHm consistently achieves higher true positive rates and lower false positive rates compared to state-of-the-art methods, including CBS, HaarSeg, and FDRSeg. In particular, the proposed method improves detection accuracy for short segments while maintaining competitive overall performance. Application to real cancer genomic data further confirms the practical utility of the method, revealing biologically meaningful CNAs associated with known cancer-related genes. These results suggest that TGUHm provides a robust and effective framework for CNA detection in challenging sequencing settings.
翻译:从下一代测序数据中检测拷贝数变异(CNA)仍然具有挑战性,特别是在噪声条件下进行短片段检测时。现有分割方法通常存在高假阳性率的问题,或无法可靠检测短片段异常,尤其在低覆盖度数据中。本研究提出了一种改进的尾贪心非平衡 Haar(TGUHm)方法,引入双阈值策略以提升分割精度。该方案能有效抑制伪尖峰,同时保持对短和长 CNA 片段的敏感性。在服从高斯分布和重尾分布的噪声模拟实验中,TGUHm 相比 CBS、HaarSeg 和 FDRSeg 等现有最优方法,持续实现了更高的真阳性率和更低的假阳性率。特别是,该方法在保持竞争性整体性能的同时,提升了短片段的检测准确率。在真实癌症基因组数据上的应用进一步证实了其实用性,揭示了与已知癌症相关基因关联的具有生物学意义的 CNA。这些结果表明,TGUHm 为挑战性测序环境下的 CNA 检测提供了稳健且有效的框架。