This paper considers the problem of the private release of sample means of speed values from traffic datasets. Our key contribution is the development of user-level differentially private algorithms that incorporate carefully chosen parameter values to ensure low estimation errors on real-world datasets, while ensuring privacy. We test our algorithms on ITMS (Intelligent Traffic Management System) data from an Indian city, where the speeds of different buses are drawn in a potentially non-i.i.d. manner from an unknown distribution, and where the number of speed samples contributed by different buses is potentially different. We then apply our algorithms to large synthetic datasets, generated based on the ITMS data. Here, we provide theoretical justification for the observed performance trends, and also provide recommendations for the choices of algorithm subroutines that result in low estimation errors. Finally, we characterize the best performance of pseudo-user creation-based algorithms on worst-case datasets via a minimax approach; this then gives rise to a novel procedure for the creation of pseudo-users, which optimizes the worst-case total estimation error. The algorithms discussed in the paper are readily applicable to general spatio-temporal IoT datasets for releasing a differentially private mean of a desired value.
翻译:本文研究了交通数据集中速度值样本均值的隐私发布问题。我们的核心贡献在于开发了满足用户级差分隐私的算法,通过精心选择的参数值,在确保隐私保护的同时,降低了真实数据集上的估计误差。我们基于印度某城市的ITMS(智能交通管理系统)数据对算法进行了测试,其中不同公交车辆的速度数据以潜在非独立同分布的方式从未知分布中抽取,且各车辆贡献的速度样本数量可能不同。随后,我们将算法应用于基于ITMS数据生成的大规模合成数据集,为观察到的性能趋势提供了理论依据,并就如何选择算法子程序以降低估计误差给出了建议。最后,我们通过极小化极大方法刻画了基于伪用户创建算法在最坏情况数据集上的最优性能,进而提出了一种新的伪用户创建流程,以优化最坏情况下的总估计误差。本文所述算法可广泛应用于通用时空物联网数据集,用于发布所需变量的差分隐私均值。