The magnitude of Pearson correlation between two scalar random variables can be visually judged from the two-dimensional scatter plot of an independent and identically distributed sample drawn from the joint distribution of the two variables: the closer the points lie to a straight slanting line, the greater the correlation. To the best of our knowledge, similar graphical representation or geometric quantification of tree correlation does not exist in the literature although tree-shaped datasets are frequently encountered in various fields, such as academic genealogy tree and embryonic development tree. In this paper, we introduce a geometric statistic to both represent tree correlation intuitively and quantify its magnitude precisely. The theoretical properties of the geometric statistic are provided. Large-scale simulations based on various data distributions demonstrate that the geometric statistic is precise in measuring the tree correlation. Its real application on mathematical genealogy trees also demonstrated its usefulness.
翻译:两个标量随机变量之间的皮尔逊相关系数大小,可以通过从两个变量的联合分布中抽取的独立同分布样本的二维散点图直观判断:数据点越接近一条倾斜直线,相关性越强。据我们所知,尽管树形数据集(例如学术谱系树和胚胎发育树)在多个领域频繁出现,但现有文献中尚未存在类似的树形相关性图形表示或几何量化方法。本文提出了一种几何统计量,既能直观表征树形相关性,又能精确量化其强度。我们给出了该几何统计量的理论性质。基于多种数据分布的大规模仿真实验表明,该统计量在测量树形相关性方面具有精确性。其在数学谱系树上的实际应用也证明了其实用价值。