Integrating probability and non-probability samples is increasingly important, yet unknown sampling mechanisms in non-probability sources complicate identification and efficient estimation. We develop semiparametric theory for dual-frame data integration and propose two complementary estimators. The first models the non-probability inclusion probability parametrically and attains the semiparametric efficiency bound. We introduce an identifiability condition based on strong monotonicity that identifies sampling-model parameters without instrumental variables, even under informative (non-ignorable) selection, using auxiliary information from the probability sample; it remains valid without record linkage between samples. The second estimator, motivated by a two-stage sampling approximation, avoids explicit modeling of the non-probability mechanism; though not fully efficient, it is efficient within a restricted augmentation class and is robust to misspecification. Simulations and an application to the Culture and Community in a Time of Crisis public simulation dataset show efficiency gains under correct specification and stable performance under misspecification and weak identification. Methods are implemented in the R package \texttt{dfSEDI}.
翻译:整合概率样本与非概率样本日益重要,然而非概率来源中未知的抽样机制使得识别与高效估计变得复杂。本文发展了双框架数据整合的半参数理论,并提出了两种互补的估计量。第一种方法通过参数化建模非概率包含概率,达到了半参数效率界。我们引入了一种基于强单调性的可识别性条件,该条件无需工具变量即可识别抽样模型参数,即使在信息性(不可忽略)选择下,也能利用概率样本的辅助信息实现;该条件在样本间无需记录链接的情况下依然有效。第二种估计量受两阶段抽样近似启发,避免了对非概率机制的显式建模;尽管未达到完全效率,但在受限的增广类内是高效的,并且对模型误设具有稳健性。模拟研究以及在"危机时期的文化与社区"公共仿真数据集上的应用表明,在正确设定模型时能获得效率提升,在模型误设与弱识别条件下仍保持稳定性能。相关方法已在R包\texttt{dfSEDI}中实现。