Overlapping asymmetric datasets are common in data science and pose questions of how they can be incorporated together into a predictive analysis. In healthcare datasets there is often a small amount of information that is available for a larger number of patients such as an electronic health record, however a small number of patients may have had extensive further testing. Common solutions such as missing imputation can often be unwise if the smaller cohort is significantly different in scale to the larger sample, therefore the aim of this research is to develop a new method which can model the smaller cohort against a particular response, whilst considering the larger cohort also. Motivated by non-parametric models, and specifically flexible smoothing techniques via generalized additive models, we model a twice penalized P-Spline approximation method to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. This second penalty is created through discrepancies in the marginal value of covariates that exist in both the smaller and larger cohorts. Through data simulations, parameter tunings and model adaptations to consider a continuous and binary response, we find our twice penalized approach offers an enhanced fit over a linear B-Spline and once penalized P-Spline approximation. Applying to a real-life dataset relating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see an improved model fit performance of over 65%. Areas for future work within this space include adapting our method to not require dimensionality reduction and also consider parametric modelling methods. However, to our knowledge this is the first work to propose additional marginal penalties in a flexible regression of which we can report a vastly improved model fit that is able to consider asymmetric datasets, without the need for missing data imputation.
翻译:重叠非对称数据集在数据科学中普遍存在,并引发如何将其整合到预测分析中的问题。在医疗健康数据集中,多数患者仅有少量可用信息(如电子健康档案),而少数患者可能接受过大量额外检查。当小规模样本与大规模样本在尺度上存在显著差异时,常见的缺失值插补方法往往不可取。本研究旨在开发一种新方法,能够在考虑大规模样本的同时,对特定响应变量建模小规模样本。受非参数模型启发,特别是通过广义可加模型实现的灵活平滑技术,我们提出一种双重惩罚P样条近似方法:第一重惩罚防止小规模样本过拟合或欠拟合,第二重惩罚则考虑大规模样本对建模的影响。该第二惩罚通过小规模样本与大规模样本共同存在协变量的边际值差异构建。通过数据模拟、参数调优及针对连续型与二元型响应变量的模型适配,我们发现双重惩罚方法比线性B样条和单次惩罚P样条近似具有更优的拟合效果。将该方法应用于非酒精性脂肪性肝炎患病风险评估的真实数据,模型拟合性能提升超过65%。未来研究方向包括:适配方法以避免降维需求,并探索参数化建模方法。据我们所知,这是首个在灵活回归中提出额外边际惩罚的工作,能够在无需缺失数据插补的情况下显著提升模型拟合效果,有效处理非对称数据集。