Biological networks are commonly used in biomedical and healthcare domains to effectively model the structure of complex biological systems with interactions linking biological entities. However, due to their characteristics of high dimensionality and low sample size, directly applying deep learning models on biological networks usually faces severe overfitting. In this work, we propose R-MIXUP, a Mixup-based data augmentation technique that suits the symmetric positive definite (SPD) property of adjacency matrices from biological networks with optimized training efficiency. The interpolation process in R-MIXUP leverages the log-Euclidean distance metrics from the Riemannian manifold, effectively addressing the swelling effect and arbitrarily incorrect label issues of vanilla Mixup. We demonstrate the effectiveness of R-MIXUP with five real-world biological network datasets on both regression and classification tasks. Besides, we derive a commonly ignored necessary condition for identifying the SPD matrices of biological networks and empirically study its influence on the model performance. The code implementation can be found in Appendix E.
翻译:生物网络被广泛用于生物医学和医疗保健领域,以有效建模由生物实体间相互作用构成的复杂生物系统结构。然而,由于生物网络具有高维度和低样本量的特性,直接将深度学习模型应用于生物网络通常会面临严重的过拟合问题。本文提出R-MIXUP——一种基于混合的数据增强技术,该技术适用于生物网络邻接矩阵的对称正定(SPD)特性,并具备优化的训练效率。R-MIXUP中的插值过程利用了黎曼流形上的对数-欧几里得距离度量,有效解决了原始混合方法中的膨胀效应和任意错误标签问题。我们通过五个真实生物网络数据集,在回归和分类任务上验证了R-MIXUP的有效性。此外,我们还推导出一个常被忽视的识别生物网络SPD矩阵的必要条件,并通过实验研究其对模型性能的影响。代码实现见附录E。