Balancing Privacy and Utility of Spatio-Temporal Data for Taxi-Demand Prediction

Taxi-demand prediction is an important application of machine learning that enables taxi-providing facilities to optimize their operations and city planners to improve transportation infrastructure and services. However, the use of sensitive data in these systems raises concerns about privacy and security. In this paper, we propose the use of federated learning for taxi-demand prediction that allows multiple parties to train a machine learning model on their own data while keeping the data private and secure. This can enable organizations to build models on data they otherwise would not be able to access. Despite its potential benefits, federated learning for taxi-demand prediction poses several technical challenges, such as class imbalance, data scarcity among some parties, and the need to ensure model generalization to accommodate diverse facilities and geographic regions. To effectively address these challenges, we propose a system that utilizes region-independent encoding for geographic lat-long coordinates. By doing so, the proposed model is not limited to a specific region, enabling it to perform optimally in any area. Furthermore, we employ cost-sensitive learning and various regularization techniques to mitigate issues related to data scarcity and overfitting, respectively. Evaluation with real-world data collected from 16 taxi service providers in Japan over a period of six months showed the proposed system predicted demand level accurately within 1\% error compared to a single model trained with integrated data. The system also effectively defended against membership inference attacks on passenger data.

翻译：出租车需求预测是机器学习的重要应用，它能使出租车服务提供商优化运营，并帮助城市规划者改善交通基础设施与服务。然而，此类系统中敏感数据的使用引发了隐私与安全方面的担忧。本文提出采用联邦学习进行出租车需求预测，使多方能够在各自数据上训练机器学习模型，同时保持数据的私密性与安全性。这能使组织机构基于原本无法访问的数据构建模型。尽管联邦学习具有潜在优势，但应用于出租车需求预测时仍面临若干技术挑战，例如类别不平衡、部分参与者数据稀缺，以及需要确保模型泛化能力以适应不同设施与地理区域。为有效应对这些挑战，我们提出一种采用区域无关编码方法处理地理经纬度坐标的系统。通过该设计，所提模型不受特定区域限制，能在任何区域实现最优性能。此外，我们采用代价敏感学习与多种正则化技术，分别缓解数据稀缺与过拟合问题。基于日本16家出租车服务提供商六个月的实地数据评估表明，与使用整合数据训练的单一模型相比，所提系统能以1%以内的误差准确预测需求水平。该系统还能有效抵御针对乘客数据的成员推理攻击。