Demographic data, such as income, education level, and employment rate, contain valuable information of urban regions, yet few studies have integrated demographic information to generate region embedding. In this study, we show how the simple and easy-to-access demographic data can improve the quality of state-of-the-art region embedding and provide better predictive performances in urban areas across three common urban tasks, namely check-in prediction, crime rate prediction, and house price prediction. We find that existing pre-train methods based on KL divergence are potentially biased towards mobility information and propose to use Jenson-Shannon divergence as a more appropriate loss function for multi-view representation learning. Experimental results from both New York and Chicago show that mobility + income is the best pre-train data combination, providing up to 10.22\% better predictive performances than existing models. Considering that mobility big data can be hardly accessible in many developing cities, we suggest geographic proximity + income to be a simple but effective data combination for region embedding pre-training.
翻译:人口统计数据,如收入、教育水平和就业率,蕴含了城市区域的宝贵信息,但现有研究鲜有将人口统计信息整合到区域嵌入的生成中。本研究展示了简单易得的人口统计数据如何提升最先进区域嵌入的质量,并在三种常见的城市任务——签到预测、犯罪率预测和房价预测中,提供更优的预测性能。我们发现,现有基于KL散度的预训练方法可能对移动性信息存在偏差,并提出使用Jenson-Shannon散度作为多视图表示学习中更合适的损失函数。纽约和芝加哥的实验结果表明,移动性+收入是最佳的预训练数据组合,其预测性能比现有模型提升最高达10.22%。考虑到移动性大数据在许多发展中城市难以获取,我们建议地理邻近性+收入作为一种简单而有效的区域嵌入预训练数据组合。