Missing data often result in undesirable bias and loss of efficiency. These become substantial problems when the response mechanism is nonignorable, such that the response model depends on unobserved variables. It is necessary to estimate the joint distribution of unobserved variables and response indicators to manage nonignorable nonresponse. However, model misspecification and identification issues prevent robust estimates despite careful estimation of the target joint distribution. In this study, we modelled the distribution of the observed parts and derived sufficient conditions for model identifiability, assuming a logistic regression model as the response mechanism and generalised linear models as the main outcome model of interest. More importantly, the derived sufficient conditions are testable with the observed data and do not require any instrumental variables, which are often assumed to guarantee model identifiability but cannot be practically determined beforehand. To analyse missing data, we propose a new imputation method which incorporates verifiable identifiability using only observed data. Furthermore, we present the performance of the proposed estimators in numerical studies and apply the proposed method to two sets of real data: exit polls for the 19th South Korean election data and public data collected from the Korean Survey of Household Finances and Living Conditions.
翻译:缺失数据常导致有偏估计及效率损失,当响应机制不可忽略(即响应模型依赖于未观测变量)时,此类问题尤为严重。为处理不可忽略的无响应问题,需对未观测变量与响应指标的联合分布进行估计。然而即便谨慎估计目标联合分布,模型误设与识别问题仍会阻碍稳健估计的获得。本研究对可观测部分的分布进行建模,在假定响应机制为逻辑回归模型且核心结局模型为广义线性模型的前提下,推导出模型可识别性的充分条件。更关键的是,所推导的充分条件可通过观测数据检验,且无需依赖通常为保证模型可识别性而假定、但实际无法预先确定的工具变量。为分析缺失数据,我们提出一种仅利用观测数据即可实现可验证识别的新插补方法。此外,我们通过数值研究展示了所提估计量的性能,并将该方法应用于两组真实数据:第19届韩国大选出口民调数据及韩国家庭金融与生活条件调查的公开数据。