Overlapping asymmetric datasets are common in data science and pose questions of how they can be incorporated together into a predictive analysis. In healthcare datasets there is often a small amount of information that is available for a larger number of patients such as an electronic health record, however a small number of patients may have had extensive further testing. Common solutions such as missing imputation can often be unwise if the smaller cohort is significantly different in scale to the larger sample, therefore the aim of this research is to develop a new method which can model the smaller cohort against a particular response, whilst considering the larger cohort also. Motivated by non-parametric models, and specifically flexible smoothing techniques via generalized additive models, we model a twice penalized P-Spline approximation method to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. This second penalty is created through discrepancies in the marginal value of covariates that exist in both the smaller and larger cohorts. Through data simulations, parameter tunings and model adaptations to consider a continuous and binary response, we find our twice penalized approach offers an enhanced fit over a linear B-Spline and once penalized P-Spline approximation. Applying to a real-life dataset relating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see an improved model fit performance of over 65%. Areas for future work within this space include adapting our method to not require dimensionality reduction and also consider parametric modelling methods. However, to our knowledge this is the first work to propose additional marginal penalties in a flexible regression of which we can report a vastly improved model fit that is able to consider asymmetric datasets, without the need for missing data imputation.
翻译:重叠非对称数据集在数据科学中较为常见,这类数据如何整合至预测分析中是一个关键问题。在医疗健康数据集中,大量患者仅有少量可用信息(如电子健康记录),而少数患者可能接受过更全面的检测。若小规模队列在量级上与大规模样本存在显著差异,常见的缺失值插补方法往往不可取。因此,本研究旨在开发一种新方法,在考虑大规模队列的同时,针对特定响应变量对小规模队列进行建模。受非参数模型(特别是通过广义可加模型实现的灵活平滑技术)启发,我们构建了一种双重惩罚P样条逼近方法:第一重惩罚用于防止小规模队列的过拟合或欠拟合,第二重惩罚则用于整合大规模队列信息。我们通过计算两个队列共有的协变量边际值差异来构建第二重惩罚。通过数据模拟、参数调优及针对连续型和二元响应变量的模型适配,我们发现双重惩罚方法相较于线性B样条和单重惩罚P样条逼近具有更优的拟合效果。将该方法应用于某患者非酒精性脂肪性肝炎风险预测的真实数据集后,模型拟合性能提升超过65%。未来研究方向包括:降低方法对降维的依赖,以及探索参数化建模方法。据我们所知,本研究首次在灵活回归中引入额外边际惩罚,在无需缺失数据插补的前提下,显著提升了对非对称数据集的模型拟合能力。