This paper considers the problem of the private release of sample means of speed values from traffic datasets. Our key contribution is the development of user-level differentially private algorithms that incorporate carefully chosen parameter values to ensure low estimation errors on real-world datasets, while ensuring privacy. We test our algorithms on ITMS (Intelligent Traffic Management System) data from an Indian city, where the speeds of different buses are drawn in a potentially non-i.i.d. manner from an unknown distribution, and where the number of speed samples contributed by different buses is potentially different. We then apply our algorithms to large synthetic datasets, generated based on the ITMS data. Here, we provide theoretical justification for the observed performance trends, and also provide recommendations for the choices of algorithm subroutines that result in low estimation errors. Finally, we characterize the best performance of pseudo-user creation-based algorithms on worst-case datasets via a minimax approach; this then gives rise to a novel procedure for the creation of pseudo-users, which optimizes the worst-case total estimation error.
翻译:本文研究了交通数据集中速度值样本均值的隐私发布问题。我们的关键贡献在于开发了用户级差分隐私算法,通过精心选择参数值,在确保隐私的同时,降低了真实数据集上的估计误差。我们在印度某城市的ITMS(智能交通管理系统)数据上测试了所提算法,其中不同公交车的速度数据可能以非独立同分布方式从未知分布中采样,且不同公交车的速度样本数量可能不同。随后,我们将算法应用于基于ITMS数据生成的大规模合成数据集。在此过程中,我们为观测到的性能趋势提供了理论依据,并针对如何选择算法子程序以降低估计误差给出了建议。最后,通过极小极大方法,我们刻画了基于伪用户创建算法在最坏情况数据集上的最优性能,由此提出了一种新的伪用户创建程序,该程序能够优化最坏情况下的总估计误差。