Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access and direct analytics on compressed data. However, existing GD algorithms exhibit quadratic complexity $\mathcal{O}(nd^{2})$, which limits their scalability for high-dimensional datasets. This paper proposes \textbf{EntroGD}, an entropy-guided GD framework that decouples analytical fidelity from compression efficiency to achieve linear complexity $\mathcal{O}(nd)$. EntroGD adopts a two-stage design, first constructing compact condensed samples to preserve information critical for analytics, and then applying entropy-based bit selection to maximize compression. Experiments on 18 IoT datasets show that EntroGD reduces configuration time by up to $53.5\times$ compared to state-of-the-art GD compressors. Moreover, by enabling analytics with access to only $2.6\%$ of the original data volume, EntroGD accelerates clustering by up to $31.6\times$ with negligible loss in accuracy. Overall, EntroGD provides a scalable and system-efficient solution for direct analytics on compressed IoT data.
翻译:物联网与信息物理系统产生的海量数据流必须在严格的带宽、延迟与资源约束下进行处理。广义去重是一种前景广阔的无损压缩框架,因其支持对压缩数据的随机访问与直接分析。然而,现有广义去重算法具有二次复杂度 $\mathcal{O}(nd^{2})$,限制了其在高维数据集上的可扩展性。本文提出 \textbf{EntroGD},一种熵引导的广义去重框架,通过将分析保真度与压缩效率解耦,实现线性复杂度 $\mathcal{O}(nd)$。EntroGD采用两阶段设计:首先构建紧凑的浓缩样本以保留对分析至关重要的信息,随后应用基于熵的比特选择以最大化压缩率。在18个物联网数据集上的实验表明,相较于最先进的广义去重压缩器,EntroGD将配置时间降低了高达 $53.5\times$。此外,通过仅需访问原始数据量 $2.6\%$ 即可进行分析,EntroGD将聚类速度提升高达 $31.6\times$,且精度损失可忽略不计。总体而言,EntroGD为压缩物联网数据的直接分析提供了一种可扩展且系统高效的解决方案。