Open data is frequently released spatially and temporally aggregated, usually to comply with privacy policies. Varying aggregation levels (e.g., zip code, census tract, city block) complicate the integration across variables needed to provide multi-variate training sets for downstream AI/ML systems. In this work, we consider models to disaggregate spatial data, learning a function from a low-resolution irregular partition (e.g., zip code) to s high-resolution irregular partition (e.g., city block). We propose a hierarchical architecture that aligns each geographic aggregation level with a layer in the network such that all aggregation levels can be learned simultaneously by including loss terms for all intermediate levels as well as the final output. We then consider additional loss terms that compare the re-aggregated output against ground truth to further improve performance. To balance the tradeoff between training time and accuracy, we consider three training regimes, including a layer-by-layer process that achieves competitive predictions with significantly reduced training time. For situations where limited historical training data is available, we study transfer learning scenarios and show that a model pre-trained on one city variable can be fine-tuned for another city variable using only a few hundred samples, highlighting the common dynamics among variables from the same built environment and underlying population. Evaluating these techniques on four datasets across two cities, three variables, and two application domains, we find that geographically coherent architectures provide a significant improvement over baseline models as well as typical heuristic methods, advancing our long-term goal of synthesizing any variable, at any location, at any resolution.
翻译:开放数据常以空间和时间聚合形式发布,通常是为了遵守隐私政策。不同的聚合层级(如邮政编码、人口普查区、城市街区)使得为下游AI/ML系统提供多变量训练集所需的数据整合变得复杂。本文研究了空间数据的降尺度模型,学习从低分辨率不规则分区(如邮政编码)到高分辨率不规则分区(如城市街区)的函数映射。我们提出一种层次化架构,将每个地理聚合层级与网络中的一层对齐,使得所有聚合层级可通过包含中间层级及最终输出的损失项同时学习。随后我们引入额外损失项,通过对比重新聚合输出与真实值来进一步提升性能。为平衡训练时间与精度之间的权衡,我们考虑了三种训练模式,其中逐层训练过程能以显著减少的训练时间获得具有竞争力的预测结果。针对历史训练数据有限的场景,我们研究了迁移学习方案,并证明在一个城市变量上预训练的模型仅需数百个样本即可针对另一个城市变量进行微调,这凸显了相同建成环境与潜在人口中变量间的共同动态特征。通过在两个城市、三个变量、两个应用领域的四个数据集上评估这些技术,我们发现地理一致架构相较于基线模型及典型启发式方法提供了显著改进,推动了我们在任意位置、任意分辨率下合成任意变量的长期目标。