Two-component mixture models are particularly useful for identifying differentially expressed genes, but their performance can deteriorate markedly when the alternative distribution departs from parametric assumptions or symmetry. We propose a semiparametric mixture model in which the null component is standard normal and the alternative follows a skew-normal scale mixture with an unspecified scale mixing distribution. This formulation accommodates skewness and heavy tails, providing a flexible and computationally tractable tool for differential gene-expression analysis without restrictive distributional assumptions. We establish identifiability and consistency of the model and develop an efficient estimation algorithm that incorporates nonparametric maximum likelihood estimation of the scale distribution. Numerical studies show notable improvements over existing parametric and nonparametric approaches for modeling the alternative distribution, and applications to colon cancer and leukemia datasets demonstrate reduced false discovery and false negative rates.
翻译:两组分混合模型在识别差异表达基因方面尤为有效,但当备择分布偏离参数假设或对称性时,其性能可能显著下降。我们提出了一种半参数混合模型,其中零假设组分服从标准正态分布,而备择组分则遵循具有未指定尺度混合分布的偏态正态尺度混合分布。该模型框架能够适应偏态和厚尾特征,为差异基因表达分析提供了一种灵活且计算可行的工具,无需施加严格的分布假设。我们证明了该模型的可识别性与一致性,并开发了一种高效的估计算法,该算法结合了尺度分布的非参数最大似然估计。数值研究表明,在模拟备择分布方面,该方法相较于现有的参数和非参数方法均有显著改进;在结肠癌和白血病数据集上的应用也表明,该方法能够有效降低错误发现率和假阴性率。