Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the R\'enyi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be accelerated if we have a data structure to compute the entropy in different subsets of data when the algorithm needs to decide what block to construct. Such a data structure will also be useful for data analysts exploring different subsets of data to identify areas of interest. While it is generally known how to compute the Shannon or the R\'enyi entropy of a discrete distribution in the offline or streaming setting efficiently, we focus on the query setting where we aim to efficiently derive the entropy among a subset of data that satisfy some linear predicates. We solve this problem in a typical setting when we deal with real data, where data items are geometric points and each requested area is a query (hyper)rectangle. More specifically, we consider a set $P$ of $n$ weighted and colored points in $\mathbb{R}^d$. For the range S-entropy (resp. R-entropy) query problem, the goal is to construct a low space data structure, such that given a query (hyper)rectangle $R$, it computes the Shannon (resp. R\'enyi) entropy based on the colors and the weights of the points in $P\cap R$, in sublinear time. We show conditional lower bounds proving that we cannot hope for data structures with near-linear space and near-constant query time for both the range S-entropy and R-entropy query problems. Then, we propose exact data structures for $d=1$ and $d>1$ with $o(n^{2d})$ space and $o(n)$ query time for both problems. Finally, we propose near linear space data structures for returning either an additive or a multiplicative approximation of the Shannon (resp. R\'enyi) entropy in $P\cap R$.
翻译:在数据压缩、列式存储和基数估计算法中,最大化/最小化香农熵或更广义的Rényi熵的数据划分是关键子程序。若能在算法需要确定构建何种数据块时,通过数据结构快速计算数据不同子集的熵值,则可加速此类划分算法。此类数据结构同样有助于数据分析师探索数据子集以识别感兴趣区域。尽管在离线或流式场景中高效计算离散分布的香农熵或Rényi熵已有成熟方法,本文聚焦于查询场景:旨在高效推导满足特定线性谓词的数据子集的熵值。我们针对现实数据的典型场景解决该问题,其中数据项为几何点,每个查询区域为(超)矩形查询。具体而言,考虑$\mathbb{R}^d$空间中包含$n$个带权重与颜色的点集$P$。对于范围S-熵(对应R-熵)查询问题,目标是构建低空间数据结构,使得给定查询(超)矩形$R$时,能在亚线性时间内基于$P\cap R$中点的颜色与权重计算香农熵(对应Rényi熵)。我们通过条件性下界证明,对于范围S-熵与R-熵查询问题,无法期望同时实现近线性空间与近常数查询时间的数据结构。随后,针对$d=1$和$d>1$的情形,提出具有$o(n^{2d})$空间与$o(n)$查询时间的精确数据结构。最后,提出近线性空间数据结构,用于返回$P\cap R$中香农熵(对应Rényi熵)的加法近似或乘法近似结果。