Local variable selection aims to discover localized effects by assessing the impact of covariates on outcomes within specific regions defined by other covariates. We outline some challenges of local variable selection in the presence of non-linear relationships and model misspecification. Specifically, we highlight a potential drawback of common semi-parametric methods: even slight model misspecification can result in a high rate of false positives. To address these shortcomings, we propose a methodology based on orthogonal cut splines that achieves consistent local variable selection in high-dimensional scenarios. Our approach offers simplicity, handles both continuous and discrete covariates, and provides theory for high-dimensional covariates and model misspecification. We discuss settings with either independent or dependent data. Our proposal allows including adjustment covariates that do not undergo selection, enhancing flexibility in modeling complex scenarios. We illustrate its application in simulation studies with both independent and functional data, as well as with two real datasets. One dataset evaluates salary gaps associated with discrimination factors at different ages, while the other examines the effects of covariates on brain activation over time. The approach is implemented in the R package mombf.
翻译:局部变量选择旨在通过评估协变量在由其他协变量定义的特定区域内对结果的影响,从而发现局部化效应。我们概述了在存在非线性关系和模型误设情况下局部变量选择面临的一些挑战。具体而言,我们指出了常见半参数方法的潜在缺陷:即使轻微的模型误设也可能导致较高的假阳性率。为解决这些问题,我们提出了一种基于正交切割样条的方法,在高维场景下实现了一致的局部变量选择。该方法具有简洁性,可处理连续型和离散型协变量,并提供了针对高维协变量和模型误设的理论支撑。我们讨论了独立数据和相依数据两种情形。我们的方案允许纳入不参与选择过程的调整协变量,从而增强了复杂场景建模的灵活性。我们通过独立数据与函数型数据的模拟研究,以及两个真实数据集展示了其应用:一个数据集评估了不同年龄下与歧视因素相关的薪资差距,另一个数据集则研究了协变量随时间对大脑激活的影响。该方法已在R包mombf中实现。