Mobile Device Location Data (MDLD) has been popularly utilized in various fields. Yet its large-scale applications are limited because of either biased or insufficient spatial coverage of the data from individual data vendors. One approach to improve the data coverage is to leverage the data from multiple data vendors and integrate them to build a more representative dataset. For data integration, further treatments on the multi-sourced dataset are required due to several reasons. First, the possibility of carrying more than one device could result in duplicated observations from the same data subject. Additionally, when utilizing multiple data sources, the same device might be captured by more than one data provider. Our paper proposes a data integration methodology for multi-sourced data to investigate the feasibility of integrating data from several sources without introducing additional biases to the data. By leveraging the uniqueness of travel pattern of each device, duplicate devices are identified. The proposed methodology is shown to be cost-effective while it achieves the desired accuracy level. Our findings suggest that devices sharing the same imputed home location and the top five most-visited locations during a month can represent the same user in the MDLD. It is shown that more than 99.6% of the sample devices having the aforementioned attribute in common are observed at the same location simultaneously. Finally, the proposed algorithm has been successfully applied to the national-level MDLD of 2020 to produce the national passenger origin-destination data for the NextGeneration National Household Travel Survey (NextGen NHTS) program.
翻译:移动设备定位数据已在多个领域获得广泛应用。然而,由于单个数据供应商的数据存在空间覆盖偏差或不足的问题,其大规模应用仍受到限制。提升数据覆盖率的有效途径之一是整合来自多个数据供应商的数据,构建更具代表性的数据集。在数据整合过程中,需要对多源数据集进行进一步处理,原因如下:首先,用户携带多台设备的可能性会导致同一数据主体产生重复观测记录;其次,当利用多个数据源时,同一设备可能被多个数据供应商捕获。本文提出了一种多源数据整合方法,旨在研究在不引入额外数据偏差的前提下整合多个数据源的可行性。通过利用每台设备出行模式的独特性,识别重复设备。实验表明,该方法在达到预期精度水平的同时具有成本效益。研究发现,在移动设备定位数据中,共享相同推算家庭位置及月度前五位高频访问位置的设备可视为同一用户。经验证,具有上述共同属性的样本设备中,超过99.6%的设备可在同一时间点观测到位于相同位置。最后,该算法已成功应用于2020年全国级移动设备定位数据,为下一代全国家庭出行调查项目生成了全国乘客起讫点数据。