In partially linear additive models the response variable is modelled with a linear component on a subset of covariates and an additive component in which the rest of the covariates enter to the model as a sum of univariate unknown functions. This structure is more flexible than the usual full linear or full nonparametric regression models, avoids the 'curse of dimensionality', is easily interpretable and allows the user to include discrete or categorical variables in the linear part. On the other hand, in practice, the user incorporates all the available variables in the model no matter how they would impact on the response variable. For this reason, variable selection plays an important role since including covariates that has a null impact on the responses will reduce the prediction capability of the model. As in other settings, outliers in the data may harm estimations based on strong assumptions, such as normality of the response variable, leading to conclusions that are not representative of the data set. In this work, we propose a family of robust estimators that estimate and select variables from both the linear and the additive part of the model simultaneously. This family considers an adaptive procedure on a general class of penalties in the regularization part of the objetive function that defines the estimators. We study the behaviour of the proposal againts its least-squares counterpart under simulations and show the advantages of its use on a real data set.
翻译:在部分线性可加模型中,响应变量通过部分协变量的线性分量与其余协变量的可加分量(以单变量未知函数之和的形式进入模型)进行建模。该结构比常见的完全线性或完全非参数回归模型更具灵活性,避免了"维数灾难",易于解释,并允许用户在线性部分纳入离散或分类变量。另一方面,实践中用户往往将所有可用变量纳入模型,而忽略其对响应变量的实际影响。因此变量选择至关重要——若纳入对响应无影响的协变量将降低模型的预测能力。与其他场景类似,数据中的异常值可能损害基于强假设(如响应变量正态性)的估计,导致所得结论无法代表数据集特征。本研究提出了一族能同时估计并选择模型中线性部分与可加部分变量的稳健估计量。该估计族在定义估计量的目标函数正则化部分中,对广义惩罚函数类采用自适应处理机制。我们通过模拟实验比较了该提案与其最小二乘对应方法的性能,并在实际数据集上展示了其应用优势。