Two-component mixture models are particularly useful for identifying differentially expressed genes, but their performance can deteriorate markedly when the alternative distribution departs from parametric assumptions or symmetry. We propose a semiparametric mixture model in which the null component is standard normal and the alternative follows a skew-normal scale mixture with an unspecified scale mixing distribution. This formulation accommodates skewness and heavy tails, providing a flexible and computationally tractable tool for differential gene-expression analysis without restrictive distributional assumptions. We establish identifiability and consistency of the model and develop an efficient estimation algorithm that incorporates nonparametric maximum likelihood estimation of the scale distribution. Numerical studies show notable improvements over existing parametric and nonparametric approaches for modeling the alternative distribution, and applications to colon cancer and leukemia datasets demonstrate reduced false discovery and false negative rates.
翻译:两组分混合模型在识别差异表达基因方面尤为有效,但当备择分布偏离参数假设或对称性时,其性能可能显著下降。我们提出了一种半参数混合模型,其中零假设组分服从标准正态分布,备择组分则遵循具有未指定尺度混合分布的偏态正态尺度混合分布。该模型框架能够同时适应偏态性和厚尾特征,为差异基因表达分析提供了一种灵活且计算可行的工具,且无需严格的分布假设。我们证明了该模型的可识别性与一致性,并开发了一种高效的估计算法,该算法结合了尺度分布的非参数最大似然估计。数值研究表明,相较于现有的参数化和非参数化方法,该模型在拟合备择分布方面具有显著改进;在结肠癌和白血病数据集上的应用进一步表明,其能够有效降低错误发现率和假阴性率。