Geometric morphometrics (GMM) is widely used to quantify shape variation, more recently serving as input for machine learning (ML) analyses. Standard practice aligns all specimens via Generalized Procrustes Analysis (GPA) prior to splitting data into training and test sets, potentially introducing statistical dependence and contaminating downstream predictive models. Here, the effects of GPA-induced contamination are formally characterised using controlled 2D and 3D simulations across varying sample sizes, landmark densities, and allometric patterns. A novel realignment procedure is proposed, whereby test specimens are aligned to the training set prior to model fitting, eliminating cross-sample dependency. Simulations reveal a robust "diagonal" in sample-size vs. landmark-space, reflecting the scaling of RMSE under isotropic variation, with slopes analytically derived from the degrees of freedom in Procrustes tangent space. The importance of spatial autocorrelation among landmarks is further demonstrated using linear and convolutional regression models, highlighting performance degradation when landmark relationships are ignored. This work establishes the need for careful preprocessing in ML applications of GMM, provides practical guidelines for realignment, and clarifies fundamental statistical constraints inherent to Procrustes shape space.
翻译:几何形态测量学(GMM)被广泛用于量化形态变异,近年来更常作为机器学习(ML)分析的输入数据。标准做法是在将数据划分为训练集和测试集之前,通过广义普氏分析(GPA)对所有标本进行对齐,这可能导致统计依赖性的引入并污染下游预测模型。本文通过控制二维与三维模拟实验,在不同样本量、标志点密度及异速生长模式下,系统表征了GPA所致污染效应。我们提出一种新颖的重对齐流程:在模型拟合前将测试集样本对齐至训练集,从而消除跨样本依赖性。模拟实验揭示了样本量与标志点空间维度间存在稳健的“对角线”关系,该关系反映了各向同性变异下均方根误差的缩放规律,其斜率可通过普氏切空间的自由度解析推导。进一步利用线性回归与卷积回归模型,论证了标志点间空间自相关的重要性,表明忽略标志点关联将导致模型性能下降。本研究确立了GMM在ML应用中需谨慎进行数据预处理的必要性,为重对齐操作提供了实用指南,并阐明了普氏形态空间固有的基本统计约束。