Cosine similarity is an established similarity metric for computing associations on vectors, and it is commonly used to identify related samples from biological perturbational data. The distribution of cosine similarity changes with the covariance of the data, and this in turn affects the statistical power to identify related signals. The relationship between the mean and covariance of the distribution of the data and the distribution of cosine similarity is poorly understood. In this work, we derive the asymptotic moments of cosine similarity as a function of the data and identify the criteria of the data covariance matrix that minimize the variance of cosine similarity. We find that the variance of cosine similarity is minimized when the eigenvalues of the covariance matrix are equal for centered data. One immediate application of this work is characterizing the null distribution of cosine similarity over a dataset with non-zero covariance structure. Furthermore, this result can be used to optimize over a set of transformations or representations on a dataset to maximize power, recall, or other discriminative metrics, with direct application to noisy biological data. While we consider the specific biological domain of perturbational data analysis, our result has potential application for any use of cosine similarity or Pearson's correlation on data with covariance structure.
翻译:余弦相似度是一种用于计算向量关联性的经典相似性度量,常用于从生物学扰动数据中识别相关样本。余弦相似度的分布随数据协方差变化,进而影响识别相关信号的统计效力。目前,数据分布的均值与协方差同余弦相似度分布之间的关系尚不明确。本研究推导了以数据为自变量的余弦相似度渐近矩,并确定了使余弦相似度方差最小化的数据协方差矩阵准则。结果表明,当中心化数据的协方差矩阵特征值相等时,余弦相似度方差达到最小。本研究的直接应用之一是对具有非零协方差结构的数据集刻画余弦相似度的零分布。此外,该结果可优化数据集上的变换或表征集合,从而最大化统计效力、召回率及其他区分性指标,尤其适用于含噪生物学数据。尽管本文聚焦于扰动数据分析这一特定生物学领域,但该结果对任何使用余弦相似度或皮尔逊相关系数处理具有协方差结构的数据均具潜在应用价值。