Missing data often result in undesirable bias and loss of efficiency. These become substantial problems when the response mechanism is nonignorable, such that the response model depends on unobserved variables. It is necessary to estimate the joint distribution of unobserved variables and response indicators to manage nonignorable nonresponse. However, model misspecification and identification issues prevent robust estimates despite careful estimation of the target joint distribution. In this study, we modelled the distribution of the observed parts and derived sufficient conditions for model identifiability, assuming a logistic regression model as the response mechanism and generalised linear models as the main outcome model of interest. More importantly, the derived sufficient conditions are testable with the observed data and do not require any instrumental variables, which are often assumed to guarantee model identifiability but cannot be practically determined beforehand. To analyse missing data, we propose a new imputation method which incorporates verifiable identifiability using only observed data. Furthermore, we present the performance of the proposed estimators in numerical studies and apply the proposed method to two sets of real data: exit polls for the 19th South Korean election data and public data collected from the Korean Survey of Household Finances and Living Conditions.
翻译:缺失数据常导致不良偏差与效率损失。当响应机制不可忽略时(即响应模型依赖于未观测变量),这些问题变得尤为严重。为处理不可忽略的无响应,必须估计未观测变量与响应指示变量的联合分布。然而,即使对目标联合分布进行精细估计,模型误设与识别问题仍会阻碍稳健估计。本研究对观测部分的分布进行建模,并在假设响应机制为逻辑回归模型、主要目标结果模型为广义线性模型的前提下,推导出模型可识别性的充分条件。更重要的是,所得充分条件可通过观测数据进行检验,且无需任何工具变量——工具变量虽常被用于保证模型可识别性,但实际中无法预先确定。针对缺失数据分析,我们提出一种新的插补方法,该方法仅利用观测数据即可实现可验证的识别性。此外,我们通过数值研究展示了所提估计量的性能,并将该方法应用于两组实际数据:韩国第19届大选出口民调数据,以及韩国家庭财务与生活条件调查的公开数据。