The popularity of transformer-based text embeddings calls for better statistical tools for measuring distributions of such embeddings. One such tool would be a method for ranking texts within a corpus by centrality, i.e. assigning each text a number signifying how representative that text is of the corpus as a whole. However, an intrinsic center-outward ordering of high-dimensional text representations is not trivial. A statistical depth is a function for ranking k-dimensional objects by measuring centrality with respect to some observed k-dimensional distribution. We adopt a statistical depth to measure distributions of transformer-based text embeddings, transformer-based text embedding (TTE) depth, and introduce the practical use of this depth for both modeling and distributional inference in NLP pipelines. We first define TTE depth and an associated rank sum test for determining whether two corpora differ significantly in embedding space. We then use TTE depth for the task of in-context learning prompt selection, showing that this approach reliably improves performance over statistical baseline approaches across six text classification tasks. Finally, we use TTE depth and the associated rank sum test to characterize the distributions of synthesized and human-generated corpora, showing that five recent synthetic data augmentation processes cause a measurable distributional shift away from associated human-generated text.
翻译:基于Transformer的文本嵌入的广泛流行要求更完善的统计工具来度量此类嵌入的分布。其中一种工具是通过中心性对语料库中的文本进行排序的方法,即为每个文本赋予一个数值,表示该文本对整个语料库的代表性程度。然而,对高维文本表示进行固有的从中心到外缘的排序并非易事。统计深度是一种函数,通过度量相对于某个观测到的k维分布的中心性来对k维对象进行排序。我们采用统计深度来度量基于Transformer的文本嵌入的分布——即TTE深度,并引入该深度在自然语言处理流水线中用于建模和分布推断的实际应用。首先,我们定义TTE深度及相关的秩和检验,以判断两个语料库在嵌入空间上是否存在显著差异。随后,我们将TTE深度应用于上下文学习中的提示选择任务,证明该方法在六项文本分类任务中相较于统计基线方法能够稳定提升性能。最后,我们利用TTE深度及其秩和检验来表征合成文本与人类生成语料库的分布,发现五种近期提出的合成数据增强方法会导致可测量的分布偏移,使其偏离相应的人类生成文本。