Handling Overlapping Asymmetric Datasets -- A Twice Penalized P-Spline Approach

Overlapping asymmetric datasets are common in data science and pose questions of how they can be incorporated together into a predictive analysis. In healthcare datasets there is often a small amount of information that is available for a larger number of patients such as an electronic health record, however a small number of patients may have had extensive further testing. Common solutions such as missing imputation can often be unwise if the smaller cohort is significantly different in scale to the larger sample, therefore the aim of this research is to develop a new method which can model the smaller cohort against a particular response, whilst considering the larger cohort also. Motivated by non-parametric models, and specifically flexible smoothing techniques via generalized additive models, we model a twice penalized P-Spline approximation method to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. This second penalty is created through discrepancies in the marginal value of covariates that exist in both the smaller and larger cohorts. Through data simulations, parameter tunings and model adaptations to consider a continuous and binary response, we find our twice penalized approach offers an enhanced fit over a linear B-Spline and once penalized P-Spline approximation. Applying to a real-life dataset relating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see an improved model fit performance of over 65%. Areas for future work within this space include adapting our method to not require dimensionality reduction and also consider parametric modelling methods. However, to our knowledge this is the first work to propose additional marginal penalties in a flexible regression of which we can report a vastly improved model fit that is able to consider asymmetric datasets, without the need for missing data imputation.

翻译：重叠非对称数据集在数据科学中普遍存在，并引发如何将其整合到预测分析中的问题。在医疗健康数据集中，多数患者仅有少量可用信息（如电子健康档案），而少数患者可能接受过大量额外检查。当小规模样本与大规模样本在尺度上存在显著差异时，常见的缺失值插补方法往往不可取。本研究旨在开发一种新方法，能够在考虑大规模样本的同时，对特定响应变量建模小规模样本。受非参数模型启发，特别是通过广义可加模型实现的灵活平滑技术，我们提出一种双重惩罚P样条近似方法：第一重惩罚防止小规模样本过拟合或欠拟合，第二重惩罚则考虑大规模样本对建模的影响。该第二惩罚通过小规模样本与大规模样本共同存在协变量的边际值差异构建。通过数据模拟、参数调优及针对连续型与二元型响应变量的模型适配，我们发现双重惩罚方法比线性B样条和单次惩罚P样条近似具有更优的拟合效果。将该方法应用于非酒精性脂肪性肝炎患病风险评估的真实数据，模型拟合性能提升超过65%。未来研究方向包括：适配方法以避免降维需求，并探索参数化建模方法。据我们所知，这是首个在灵活回归中提出额外边际惩罚的工作，能够在无需缺失数据插补的情况下显著提升模型拟合效果，有效处理非对称数据集。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日