A Modelling Framework for Regression with Collinearity

from arxiv, v2: Notation and presentation is changed for better understanding, a section of simulation and empirical analyses added (Sec.5), the proofs of Lemmas and Propositions moved to Appendix

This study addresses a fundamental, yet overlooked, gap between standard theory and empirical modelling practices in the OLS regression model $\boldsymbol{y}=\boldsymbol{X\beta}+\boldsymbol{u}$ with collinearity. In fact, while an estimated model in practice is desired to have stability and efficiency in its "individual OLS estimates", $\boldsymbol{y}$ itself has no capacity to identify and control the collinearity in $\boldsymbol{X}$ and hence no theory including model selection process (MSP) would fill this gap unless $\boldsymbol{X}$ is controlled in view of sampling theory. In this paper, first introducing a new concept of "empirically effective modelling" (EEM), we propose our EEM methodology (EEM-M) as an integrated process of two MSPs with data $(\boldsymbol{y^o,X})$ given. The first MSP uses $\boldsymbol{X}$ only, called the XMSP, and pre-selects a class $\scr{D}$ of models with individually inefficiency-controlled and collinearity-controlled OLS estimates, where the corresponding two controlling variables are chosen from predictive standard error of each estimate. Next, defining an inefficiency-collinearity risk index for each model, a partial ordering is introduced onto the set of models to compare without using $\boldsymbol{y^o}$, where the better-ness and admissibility of models are discussed. The second MSP is a commonly used MSP that uses $(\boldsymbol{y^o,X})$, and evaluates total model performance as a whole by such AIC, BIC, etc. to select an optimal model from $\scr{D}$. Third, to materialize the XMSP, two algorithms are proposed.

翻译：本研究解决了普通最小二乘（OLS）回归模型 $\boldsymbol{y}=\boldsymbol{X\beta}+\boldsymbol{u}$ 中标准理论与实证建模实践之间存在的一个基础性但常被忽视的差距。事实上，虽然实际中期望估计模型在其"个体OLS估计量"上具有稳定性和效率，但 $\boldsymbol{y}$ 本身无法识别和控制 $\boldsymbol{X}$ 中的共线性，因此除非基于抽样理论对 $\boldsymbol{X}$ 进行控制，否则包括模型选择过程（MSP）在内的任何理论都无法填补这一空白。本文首先引入"经验有效建模"（EEM）这一新概念，提出EEM方法论（EEM-M）作为给定数据 $(\boldsymbol{y^o,X})$ 的两个MSP的集成过程。第一个MSP仅使用 $\boldsymbol{X}$，称为XMSP，预选出一类具有个体低效控制与共线性控制OLS估计量的模型 $\scr{D}$，其中对应的两个控制变量选自每个估计量的预测标准误差。接着，为每个模型定义低效-共线性风险指数，在模型集合上引入偏序关系以在不使用 $\boldsymbol{y^o}$ 的情况下进行比较，并讨论模型的优劣性和可采纳性。第二个MSP是常用的基于 $(\boldsymbol{y^o,X})$ 的MSP，通过AIC、BIC等准则评估整体模型性能，从 $\scr{D}$ 中选择最优模型。最后，为实现XMSP，本文提出两种算法。