Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.
翻译:理解自然语言处理模型潜在空间的几何性质,能够通过操控这些性质提升下游任务的性能。其中一个重要性质是模型潜在空间中数据的分布程度,即潜在空间被利用的充分性。在本研究中,我们定义了数据分布的概念,并论证了当前常用的数据分布度量指标——平均余弦相似度与基于配分函数的最小/最大比值 I(V)——无法可靠地比较不同模型对潜在空间的利用情况。我们提出了八种替代的数据分布度量方法,其中七种在七种合成数据分布上的表现优于现有指标。在提出的度量方法中,我们推荐两种具有可靠性且可用于比较不同规模与维度模型的相对分布度量:一种基于主成分分析,另一种基于信息熵。