Machine learning techniques are steadily becoming more important in modern biology, and are used to build predictive models, discover patterns, and investigate biological problems. However, models trained on one dataset are often not generalizable to other datasets from different cohorts or laboratories, due to differences in the statistical properties of these datasets. These could stem from technical differences, such as the measurement technique used, or from relevant biological differences between the populations studied. Domain adaptation, a type of transfer learning, can alleviate this problem by aligning the statistical distributions of features and samples among different datasets so that similar models can be applied across them. However, a majority of state-of-the-art domain adaptation methods are designed to work with large-scale data, mostly text and images, while biological datasets often suffer from small sample sizes, and possess complexities such as heterogeneity of the feature space. This Review aims to synthetically discuss domain adaptation methods in the context of small-scale and highly heterogeneous biological data. We describe the benefits and challenges of domain adaptation in biological research and critically discuss some of its objectives, strengths, and weaknesses through key representative methodologies. We argue for the incorporation of domain adaptation techniques to the computational biologist's toolkit, with further development of customized approaches.
翻译:机器学习技术在现代生物学中日益重要,被用于构建预测模型、发现模式以及研究生物学问题。然而,由于不同数据集统计特性的差异,在一个数据集上训练的模型通常难以推广到来自不同队列或实验室的其他数据集。这些差异可能源于技术性因素(如所使用的测量技术),也可能源于所研究群体间相关的生物学差异。领域自适应作为迁移学习的一种,可以通过对齐不同数据集间特征与样本的统计分布来缓解这一问题,从而使相似的模型能够跨数据集应用。然而,大多数前沿的领域自适应方法是为大规模数据(主要是文本和图像)设计的,而生物数据集往往样本量较小,且具有特征空间异质性等复杂性。本综述旨在综合探讨小规模、高度异质性生物数据背景下的领域自适应方法。我们阐述了领域自适应在生物学研究中的优势与挑战,并通过关键代表性方法,批判性地讨论了其若干目标、优势与局限。我们主张将领域自适应技术纳入计算生物学家的工具箱,并进一步开发定制化的方法。