We consider the classical problem of discrete distribution estimation using i.i.d. samples in a novel scenario where additional side information is available on the distribution. In large alphabet datasets such as text corpora, such side information arises naturally through word semantics/similarities that can be inferred by closeness of vector word embeddings, for instance. We consider two specific models for side information--a local model where the unknown distribution is in the neighborhood of a known distribution, and a partial ordering model where the alphabet is partitioned into known higher and lower probability sets. In both models, we theoretically characterize the improvement in a suitable squared-error risk because of the available side information. Simulations over natural language and synthetic data illustrate these gains.
翻译:本文研究离散分布估计的经典问题,该问题基于独立同分布样本,并引入一种存在额外辅助信息的新场景。在文本语料库等大字母表数据集中,此类辅助信息可通过向量词嵌入的邻近性推断词义/相似度而自然产生。我们考虑两种具体的辅助信息模型:局部模型(未知分布位于已知分布的邻域内)与偏序模型(字母表被划分为已知的高概率集与低概率集)。针对两种模型,我们从理论上刻画了因辅助信息存在而在适当平方误差风险上获得的改进。基于自然语言与合成数据的仿真实验验证了这些性能提升。