Household and individual-level sociodemographic data are essential for understanding human-infrastructure interaction and policymaking. However, the Public Use Microdata Sample (PUMS) offers only a sample at the state level, while census tract data only provides the marginal distributions of variables without correlations. Therefore, we need an accurate synthetic population dataset that maintains consistent variable correlations observed in microdata, preserves household-individual and individual-individual relationships, adheres to state-level statistics, and accurately represents the geographic distribution of the population. We propose a deep generative framework leveraging the variational autoencoder (VAE) to generate a synthetic population with the aforementioned features. The methodological contributions include (1) a new data structure for capturing household-individual and individual-individual relationships, (2) a transfer learning process with pre-training and fine-tuning steps to generate households and individuals whose aggregated distributions align with the census tract marginal distribution, and (3) decoupled binary cross-entropy (D-BCE) loss function enabling distribution shift and out-of-sample records generation. Model results for an application in Delaware, USA demonstrate the ability to ensure the realism of generated household-individual records and accurately describe population statistics at the census tract level compared to existing methods. Furthermore, testing in North Carolina, USA yielded promising results, supporting the transferability of our method.
翻译:家庭与个体层面的社会人口统计数据对于理解人-基础设施互动和制定政策至关重要。然而,公共使用微观数据样本(PUMS)仅提供州级样本,而人口普查区数据仅提供变量的边缘分布而不包含相关性。因此,我们需要一个精确的合成人口数据集,该数据集需保持微观数据中观测到的变量相关性一致,保留家庭-个体及个体-个体关系,符合州级统计数据,并准确反映人口的地理分布。我们提出了一种利用变分自编码器(VAE)的深度生成框架,以生成具有上述特征的合成人口。方法学贡献包括:(1)一种用于捕捉家庭-个体及个体-个体关系的新数据结构;(2)包含预训练和微调步骤的迁移学习过程,用于生成其聚合分布与人口普查区边缘分布一致的家庭和个体;(3)解耦二元交叉熵(D-BCE)损失函数,支持分布偏移和样本外记录生成。在美国特拉华州的应用模型结果表明,与现有方法相比,该方法能够确保生成的家庭-个体记录的真实性,并准确描述人口普查区层面的人口统计特征。此外,在美国北卡罗来纳州的测试也取得了有希望的结果,支持了我们方法的可迁移性。