Recently, Foursquare released a global dataset with more than 100 million points of interest (POIs), each representing a real-world business on its platform. However, many entries lack complete metadata such as addresses or categories, and some correspond to non-existent or fictional locations. In contrast, OpenStreetMap (OSM) offers a rich, user-contributed POI dataset with detailed and frequently updated metadata, though it does not formally verify whether a POI represents an actual business. In this data paper, we present a methodology that integrates the strengths of both datasets: Foursquare as a comprehensive baseline of commercial POIs and OSM as a source of enriched metadata. The combined dataset totals approximately 1 TB. While this full version is not publicly released, we provide filtered releases with adjustable thresholds that reduce storage needs and make the data practical to download and use across domains. We also provide step-by-step instructions to reproduce the full 631 GB build. Record linkage is achieved by computing name similarity scores and spatial distances between Foursquare and OSM POIs. These measures identify and retain high-confidence matches that correspond to real businesses in Foursquare, have representations in OSM, and show strong name similarity. Finally, we use this filtered dataset to construct a graph-based representation of POIs enriched with attributes from both sources, enabling advanced spatial analyses and a range of downstream applications.
翻译:近期,Foursquare发布了包含超过1亿个兴趣点(POI)的全球数据集,每个POI代表其平台上的真实商业实体。然而,许多条目缺少完整的元数据(如地址或分类信息),部分条目对应不存在或虚构的地点。相比之下,OpenStreetMap(OSM)提供了由用户贡献的、具有详细且频繁更新元数据的POI数据集,但未正式验证POI是否代表实际商业实体。本数据论文提出一种整合两数据集优势的方法论:以Foursquare作为商业POI的全面基准,以OSM作为增强元数据来源。合并后的数据集总量约1TB。虽然完整版本未公开,我们提供了可调节阈值的过滤版本,显著降低存储需求,使数据具备跨领域下载与使用的可行性。同时提供分步指南以复现完整的631GB构建流程。通过计算Foursquare与OSM兴趣点之间的名称相似度得分和空间距离实现记录关联,该方法能识别并保留高置信度匹配项——这些匹配项对应Foursquare中的真实商业实体,在OSM中存在对应表征,且呈现强名称相似性。最后,我们利用该过滤数据集构建了融合双源属性的图结构POI表征,为高级空间分析及下游应用提供支持。