Handling Overlapping Asymmetric Datasets -- A Twice Penalized P-Spline Approach

Overlapping asymmetric datasets are common in data science and pose questions of how they can be incorporated together into a predictive analysis. In healthcare datasets there is often a small amount of information that is available for a larger number of patients such as an electronic health record, however a small number of patients may have had extensive further testing. Common solutions such as missing imputation can often be unwise if the smaller cohort is significantly different in scale to the larger sample, therefore the aim of this research is to develop a new method which can model the smaller cohort against a particular response, whilst considering the larger cohort also. Motivated by non-parametric models, and specifically flexible smoothing techniques via generalized additive models, we model a twice penalized P-Spline approximation method to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. This second penalty is created through discrepancies in the marginal value of covariates that exist in both the smaller and larger cohorts. Through data simulations, parameter tunings and model adaptations to consider a continuous and binary response, we find our twice penalized approach offers an enhanced fit over a linear B-Spline and once penalized P-Spline approximation. Applying to a real-life dataset relating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see an improved model fit performance of over 65%. Areas for future work within this space include adapting our method to not require dimensionality reduction and also consider parametric modelling methods. However, to our knowledge this is the first work to propose additional marginal penalties in a flexible regression of which we can report a vastly improved model fit that is able to consider asymmetric datasets, without the need for missing data imputation.

翻译：重叠非对称数据集在数据科学中较为常见，这类数据如何整合至预测分析中是一个关键问题。在医疗健康数据集中，大量患者仅有少量可用信息（如电子健康记录），而少数患者可能接受过更全面的检测。若小规模队列在量级上与大规模样本存在显著差异，常见的缺失值插补方法往往不可取。因此，本研究旨在开发一种新方法，在考虑大规模队列的同时，针对特定响应变量对小规模队列进行建模。受非参数模型（特别是通过广义可加模型实现的灵活平滑技术）启发，我们构建了一种双重惩罚P样条逼近方法：第一重惩罚用于防止小规模队列的过拟合或欠拟合，第二重惩罚则用于整合大规模队列信息。我们通过计算两个队列共有的协变量边际值差异来构建第二重惩罚。通过数据模拟、参数调优及针对连续型和二元响应变量的模型适配，我们发现双重惩罚方法相较于线性B样条和单重惩罚P样条逼近具有更优的拟合效果。将该方法应用于某患者非酒精性脂肪性肝炎风险预测的真实数据集后，模型拟合性能提升超过65%。未来研究方向包括：降低方法对降维的依赖，以及探索参数化建模方法。据我们所知，本研究首次在灵活回归中引入额外边际惩罚，在无需缺失数据插补的前提下，显著提升了对非对称数据集的模型拟合能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日