Efficient Representation of Large-Alphabet Probability Distributions

A number of engineering and scientific problems require representing and manipulating probability distributions over large alphabets, which we may think of as long vectors of reals summing to $1$. In some cases it is required to represent such a vector with only $b$ bits per entry. A natural choice is to partition the interval $[0,1]$ into $2^b$ uniform bins and quantize entries to each bin independently. We show that a minor modification of this procedure -- applying an entrywise non-linear function (compander) $f(x)$ prior to quantization -- yields an extremely effective quantization method. For example, for $b=8 (16)$ and $10^5$-sized alphabets, the quality of representation improves from a loss (under KL divergence) of $0.5 (0.1)$ bits/entry to $10^{-4} (10^{-9})$ bits/entry. Compared to floating point representations, our compander method improves the loss from $10^{-1}(10^{-6})$ to $10^{-4}(10^{-9})$ bits/entry. These numbers hold for both real-world data (word frequencies in books and DNA $k$-mer counts) and for synthetic randomly generated distributions. Theoretically, we set up a minimax optimality criterion and show that the compander $f(x) ~\propto~ \mathrm{ArcSinh}(\sqrt{(1/2) (K \log K) x})$ achieves near-optimal performance, attaining a KL-quantization loss of $\asymp 2^{-2b} \log^2 K$ for a $K$-letter alphabet and $b\to \infty$. Interestingly, a similar minimax criterion for the quadratic loss on the hypercube shows optimality of the standard uniform quantizer. This suggests that the $\mathrm{ArcSinh}$ quantizer is as fundamental for KL-distortion as the uniform quantizer for quadratic distortion.

翻译：许多工程和科学问题需要表示和操作大字母表上的概率分布，这可以视为和为$1$的实数长向量。在某些情况下，每个条目仅需用$b$位来表示这样的向量。一种自然的选择是将区间$[0,1]$划分为$2^b$个均匀区间，并独立地对每个条目进行量化。我们表明，对该过程进行一个微小修改——在量化前对每个条目应用一个非线性函数（压缩扩展器）$f(x)$——可产生一种极为有效的量化方法。例如，对于$b=8 (16)$和规模为$10^5$的字母表，表示质量（在KL散度下）从每个条目损失$0.5 (0.1)$比特提升至$10^{-4} (10^{-9})$比特。与浮点表示相比，我们的压缩扩展方法将损失从每个条目$10^{-1}(10^{-6})$比特降低至$10^{-4}(10^{-9})$比特。这些数值适用于真实世界数据（书籍中的词频和DNA $k$-mer计数）和合成随机生成分布。理论上，我们建立了一个极小极大最优性准则，并表明压缩扩展器$f(x) ~\propto~ \mathrm{ArcSinh}(\sqrt{(1/2) (K \log K) x})$达到了近乎最优的性能，对于$K$字母表且$b\to \infty$时，KL量化损失为$\asymp 2^{-2b} \log^2 K$。有趣的是，对于超立方体上的二次损失，类似的极小极大准则显示了标准均匀量化器的最优性。这表明$\mathrm{ArcSinh}$量化器对于KL失真而言，正如均匀量化器对于二次失真一样具有基础性。