Population censuses are vital to public policy decision-making. They provide insight into human resources, demography, culture, and economic structure at local, regional, and national levels. However, such surveys are very expensive (especially for low and middle-income countries with high populations, such as India), time-consuming, and may also raise privacy concerns, depending upon the kinds of data collected. In light of these issues, we introduce SynthPop++, a novel hybrid framework, which can combine data from multiple real-world surveys (with different, partially overlapping sets of attributes) to produce a real-scale synthetic population of humans. Critically, our population maintains family structures comprising individuals with demographic, socioeconomic, health, and geolocation attributes: this means that our ``fake'' people live in realistic locations, have realistic families, etc. Such data can be used for a variety of purposes: we explore one such use case, Agent-based modelling of infectious disease in India. To gauge the quality of our synthetic population, we use both machine learning and statistical metrics. Our experimental results show that synthetic population can realistically simulate the population for various administrative units of India, producing real-scale, detailed data at the desired level of zoom -- from cities, to districts, to states, eventually combining to form a country-scale synthetic population.
翻译:人口普查对于公共政策决策至关重要,能够提供地方、区域和国家层面的人力资源、人口结构、文化及经济状况的洞察。然而,此类调查成本极高(尤其对于印度等人口众多但收入偏低的国家),耗时较长,且可能因数据收集类型引发隐私顾虑。针对这些问题,我们提出新型混合框架SynthPop++,该框架可融合多个真实调查数据(这些调查覆盖不同且部分重叠的属性集),生成真实规模的人类合成人口。关键的是,我们生成的人口保持家庭结构,包含具有人口统计学、社会经济、健康及地理位置属性的个体;这意味着我们的"虚拟"人口生活在真实的地理位置、拥有真实的家庭关系等。此类数据可用于多种场景:我们探索了其中一种用例——基于智能体的印度传染病建模。为评估合成人口质量,我们采用机器学习与统计指标。实验结果表明,合成人口能够真实模拟印度各级行政单位的人口分布,以所需精细度(从城市、地区到邦级)生成真实规模的详细数据,最终整合形成国家级合成人口。