Medical studies frequently require to extract the relationship between each covariate and the outcome with statistical confidence measures. To do this, simple parametric models are frequently used (e.g. coefficients of linear regression) but usually fitted on the whole dataset. However, it is common that the covariates may not have a uniform effect over the whole population and thus a unified simple model can miss the heterogeneous signal. For example, a linear model may be able to explain a subset of the data but fail on the rest due to the nonlinearity and heterogeneity in the data. In this paper, we propose DDGroup (data-driven group discovery), a data-driven method to effectively identify subgroups in the data with a uniform linear relationship between the features and the label. DDGroup outputs an interpretable region in which the linear model is expected to hold. It is simple to implement and computationally tractable for use. We show theoretically that, given a large enough sample, DDGroup recovers a region where a single linear model with low variance is well-specified (if one exists), and experiments on real-world medical datasets confirm that it can discover regions where a local linear model has improved performance. Our experiments also show that DDGroup can uncover subgroups with qualitatively different relationships which are missed by simply applying parametric approaches to the whole dataset.
翻译:医学研究常需提取各协变量与结局之间的关系,并提供统计置信度量。为达成此目的,通常采用简单参数模型(如线性回归系数),但这些模型往往基于完整数据集拟合。然而,协变量对整体人群的影响未必具有一致性,因此统一简单模型可能会遗漏异质性信号。例如,线性模型可能仅能解释部分数据子集,而因数据中的非线性特征与异质性无法适用于剩余数据。本文提出数据驱动子群发现方法DDGroup(data-driven group discovery),通过数据驱动方式有效识别特征与标签间具有统一线性关系的数据子群。DDGroup可输出线性模型预期成立的具有可解释性的区域,其实现简单且计算可操作性强。理论层面,我们证明在样本量充足条件下,若存在方差较小的单线性模型可良好拟合的目标区域,DDGroup能够恢复该区域;真实医疗数据集的实验证实,该方法可发现局部线性模型性能更优的区域。实验同时表明,DDGroup能识别出仅对完整数据集应用参数模型时被遗漏的、具有本质差异关系的数据子群。