Domain adaptation in small-scale and heterogeneous biological datasets

Machine learning techniques are steadily becoming more important in modern biology, and are used to build predictive models, discover patterns, and investigate biological problems. However, models trained on one dataset are often not generalizable to other datasets from different cohorts or laboratories, due to differences in the statistical properties of these datasets. These could stem from technical differences, such as the measurement technique used, or from relevant biological differences between the populations studied. Domain adaptation, a type of transfer learning, can alleviate this problem by aligning the statistical distributions of features and samples among different datasets so that similar models can be applied across them. However, a majority of state-of-the-art domain adaptation methods are designed to work with large-scale data, mostly text and images, while biological datasets often suffer from small sample sizes, and possess complexities such as heterogeneity of the feature space. This Review aims to synthetically discuss domain adaptation methods in the context of small-scale and highly heterogeneous biological data. We describe the benefits and challenges of domain adaptation in biological research and critically discuss some of its objectives, strengths, and weaknesses through key representative methodologies. We argue for the incorporation of domain adaptation techniques to the computational biologist's toolkit, with further development of customized approaches.

翻译：机器学习技术在现代生物学中日益重要，被用于构建预测模型、发现模式以及研究生物学问题。然而，由于不同数据集统计特性的差异，在一个数据集上训练的模型通常难以推广到来自不同队列或实验室的其他数据集。这些差异可能源于技术性因素（如所使用的测量技术），也可能源于所研究群体间相关的生物学差异。领域自适应作为迁移学习的一种，可以通过对齐不同数据集间特征与样本的统计分布来缓解这一问题，从而使相似的模型能够跨数据集应用。然而，大多数前沿的领域自适应方法是为大规模数据（主要是文本和图像）设计的，而生物数据集往往样本量较小，且具有特征空间异质性等复杂性。本综述旨在综合探讨小规模、高度异质性生物数据背景下的领域自适应方法。我们阐述了领域自适应在生物学研究中的优势与挑战，并通过关键代表性方法，批判性地讨论了其若干目标、优势与局限。我们主张将领域自适应技术纳入计算生物学家的工具箱，并进一步开发定制化的方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

随机特征核近似综述: 算法与理论，Random Features for Kernel Approximation: A Survey in Algorithms, Theory, and Beyond

专知会员服务

33+阅读 · 2020年4月26日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日