We consider the classical problem of discrete distribution estimation using i.i.d. samples in a novel scenario where additional side information is available on the distribution. In large alphabet datasets such as text corpora, such side information arises naturally through word semantics/similarities that can be inferred by closeness of vector word embeddings, for instance. We consider two specific models for side information--a local model where the unknown distribution is in the neighborhood of a known distribution, and a partial ordering model where the alphabet is partitioned into known higher and lower probability sets. In both models, we theoretically characterize the improvement in a suitable squared-error risk because of the available side information. Simulations over natural language and synthetic data illustrate these gains.
翻译:本文研究离散分布估计这一经典问题,但考虑了一种新颖场景:在利用独立同分布样本进行估计时,可获得关于目标分布的辅助信息。在大字母表数据集(如文本语料库)中,此类辅助信息天然存在——例如可通过词向量嵌入的邻近性推断词语语义/相似性。我们构建了两种具体的辅助信息模型:局部模型(假设未知分布位于已知分布的邻域内)与偏序模型(将字母表划分为已知的高概率子集与低概率子集)。针对这两种模型,我们从理论上刻画了辅助信息在适当平方误差风险度量下带来的改进程度。基于自然语言与合成数据的仿真实验验证了这些性能增益。