In many categorical response regression applications, the response categories admit a multiresolution structure. That is, subsets of the response categories may naturally be combined into coarser response categories. In such applications, practitioners are often interested in estimating the resolution at which a predictor affects the response category probabilities. In this article, we propose a method for fitting the multinomial logistic regression model in high dimensions that addresses this problem in a unified and data-driven way. In particular, our method allows practitioners to identify which predictors distinguish between coarse categories but not fine categories, which predictors distinguish between fine categories, and which predictors are irrelevant. For model fitting, we propose a scalable algorithm that can be applied when the coarse categories are defined by either overlapping or nonoverlapping sets of fine categories. Statistical properties of our method reveal that it can take advantage of this multiresolution structure in a way existing estimators cannot. We use our method to model cell type probabilities as a function of a cell's gene expression profile (i.e., cell type annotation). Our fitted model provides novel biological insights which may be useful for future automated and manual cell type annotation methodology.
翻译:在许多分类响应回归应用中,响应类别具有多分辨率结构。也就是说,响应类别的子集可以自然地合并为更粗粒度的响应类别。在此类应用中,研究者通常需要估计预测变量在哪个分辨率层级上影响响应类别的概率。本文提出一种适用于高维数据的多项逻辑回归模型拟合方法,能够以统一且数据驱动的方式解决上述问题。具体而言,我们的方法允许研究者识别:哪些预测变量可区分粗粒度类别但无法区分细粒度类别,哪些预测变量可区分细粒度类别,以及哪些预测变量无关。在模型拟合方面,我们提出一种可扩展算法,适用于粗粒度类别由重叠或非重叠的细粒度类别集合定义的情形。方法的统计特性表明,它能够利用这种多分辨率结构,而现有估计量无法实现这一点。我们应用该方法将细胞类型概率建模为细胞基因表达谱的函数(即细胞类型注释)。所拟合的模型提供了新颖的生物学见解,有望为未来自动化及人工细胞类型注释方法提供参考。