A corpus of vector-embedded text documents has some empirical distribution. Given two corpora, we want to calculate a single metric of distance (e.g., Mauve, Frechet Inception) between them. We describe an abstract quality, called `distributionality', of such metrics. A non-distributional metric tends to use very local measurements, or uses global measurements in a way that does not fully reflect the distributions' true distance. For example, if individual pairwise nearest-neighbor distances are low, it may judge the two corpora to have low distance, even if their two distributions are in fact far from each other. A more distributional metric will, in contrast, better capture the distributions' overall distance. We quantify this quality by constructing a Known-Similarity Corpora set from two paraphrase corpora and calculating the distance between paired corpora from it. The distances' trend shape as set element separation increases should quantify the distributionality of the metric. We propose that Average Hausdorff Distance and energy distance between corpora are representative examples of non-distributional and distributional distance metrics, to which other metrics can be compared, to evaluate how distributional they are.
翻译:由向量嵌入的文本文档组成的语料库具有某种经验分布。给定两个语料库,我们想计算它们之间的单一距离度量(例如 Mauve、Frechet Inception)。我们描述了一种称为“分布性”的抽象性质,它属于这类度量。非分布性度量倾向于使用非常局部的测量,或者以一种不能完全反映分布真实距离的方式使用全局测量。例如,如果个体成对的最近邻距离较低,它可能判定两个语料库的距离较低,即使它们的两个分布实际上相距甚远。相比之下,更具分布性的度量能够更好地捕捉分布的整体距离。我们通过从两个释义语料库构建一个已知相似性语料库集,并计算其中配对语料库之间的距离来量化这一性质。随着集合元素间距增加,距离的趋势形状应能量化度量的分布性。我们提出,平均豪斯多夫距离和能量距离分别是非分布性和分布性距离度量的代表性示例,其他度量可与之比较,以评估它们的分布程度。