We observe an infinite sequence of independent identically distributed random variables $X_1,X_2,\ldots$ drawn from an unknown distribution $p$ over $[n]$, and our goal is to estimate the entropy $H(p)=-\mathbb{E}[\log p(X)]$ within an $\varepsilon$-additive error. To that end, at each time point we are allowed to update a finite-state machine with $S$ states, using a possibly randomized but time-invariant rule, where each state of the machine is assigned an entropy estimate. Our goal is to characterize the minimax memory complexity $S^*$ of this problem, which is the minimal number of states for which the estimation task is feasible with probability at least $1-\delta$ asymptotically, uniformly in $p$. Specifically, we show that there exist universal constants $C_1$ and $C_2$ such that $ S^* \leq C_1\cdot\frac{n (\log n)^4}{\varepsilon^2\delta}$ for $\varepsilon$ not too small, and $S^* \geq C_2 \cdot \max \{n, \frac{\log n}{\varepsilon}\}$ for $\varepsilon$ not too large. The upper bound is proved using approximate counting to estimate the logarithm of $p$, and a finite memory bias estimation machine to estimate the expectation operation. The lower bound is proved via a reduction of entropy estimation to uniformity testing. We also apply these results to derive bounds on the memory complexity of mutual information estimation.
翻译:我们观测到由未知分布$p$(支撑在$[n]$上)生成的独立同分布随机变量无穷序列$X_1,X_2,\ldots$,目标是在$\varepsilon$加性误差内估计熵$H(p)=-\mathbb{E}[\log p(X)]$。为此,在每个时间步,我们允许使用可能随机但时间不变的规则更新一个含$S$个状态的有限状态机,其中每个状态分配一个熵估计值。我们的目标是刻画该问题的最小最大存储复杂度$S^*$,即渐近意义上以至少$1-\delta$概率一致地(对$p$)完成估计任务所需的最少状态数。具体地,我们证明存在通用常数$C_1$和$C_2$,使得当$\varepsilon$不过小时有$S^* \leq C_1\cdot\frac{n (\log n)^4}{\varepsilon^2\delta}$,当$\varepsilon$不过大时有$S^* \geq C_2 \cdot \max \{n, \frac{\log n}{\varepsilon}\}$。上界通过使用近似计数估计$\log p$,以及有限存储偏差估计机器实现期望运算来证明;下界通过将熵估计问题归约为均匀性检验得到。我们还将这些结果应用于推导互信息估计的存储复杂度界限。