The exponential growth of geospatial data streams flowing from IoT devices challenges conventional cloud-based analytics, which typically suffer from network bandwidth waste and latency, basically attributed to the data being managed completely by Cloud, such as centralized sampling. To address this gap, we propose EdgeApproxGeo, a novel edge-cloud architecture that performs spatial-stratified online sampling at network edge devices near data sources. Our system introduces a novel sampling method called EdgeSOS, which is a unique decentralized, geohash-based stratified sampling algorithm designed to operate independently at resource-constrained edge nodes without cross-node synchronization, coupled with spatial-aware data distribution and topic routing in Apache Kafka data stream ingestion, aiming at optimizing downstream data stream processing analytics. We evaluated our system on two real-world geo-referenced datasets, mobility and air quality, and EdgeApproxGeo achieves a significant speedup over cloud-only baselines while maintaining errors in check (e.g., MAPE < 10% error rate at 80% sampling rate). We further demonstrate that coarser geohash granularity (e.g., Geohash-5) can reduce error figures by 30% as compared to finer counterparts (i.e., Geohash-6), thus revealing a tunable accuracy-efficiency trade-off. Our standard-compliant prototype, built atop Apache Kafka and Apache Spark, further validates the utility of edge-deployed approximate query processing for real-time big geospatial data analytics.
翻译:物联网设备产生的海量地理空间数据流对传统基于云的分析方法构成了挑战,后者通常面临网络带宽浪费和延迟问题,这主要归因于数据完全由云管理(例如集中式采样)。为弥补这一不足,我们提出EdgeApproxGeo,一种新型边缘-云架构,该架构在靠近数据源的网络边缘设备上执行空间分层在线采样。我们的系统引入了一种名为EdgeSOS的创新采样方法,这是一种独特的去中心化、基于地理哈希的分层采样算法,专为资源受限的边缘节点设计,无需跨节点同步即可独立运行,并结合了Apache Kafka数据流摄取中的空间感知数据分发与主题路由,旨在优化下游数据流处理分析。我们在两个真实世界的地理参考数据集(移动性和空气质量)上评估了该系统,EdgeApproxGeo相较于纯云基线实现了显著的加速,同时将误差控制在可接受范围内(例如,在80%采样率下,MAPE错误率低于10%)。我们进一步证明,较粗的地理哈希粒度(例如Geohash-5)相较于较细粒度(即Geohash-6)可将误差指标降低30%,从而揭示了一种可调谐的精度-效率权衡。我们基于Apache Kafka和Apache Spark构建的符合标准的原型系统,进一步验证了边缘部署的近似查询处理在实时大数据地理空间分析中的实用性。