GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost has led to an exponential increase in patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? We argue that representation learning, which transforms raw data into meaningful feature spaces, is an effective approach for addressing both indexing and classification challenges. We introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring variable-length contexts and detailed annotations, designed for deep learning models to learn GV representations across various traits, diseases, tissue types, and experimental contexts. Our contributions are three-fold: (i) Construction of a comprehensive dataset with 7 million records, each labeled with characteristics of the corresponding variants, alongside additional data from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant combinations, and 156 unique clinically verified GVs from real-world patients. (ii) Analysis of the structure and properties of the dataset. (iii) Experimentation of the dataset with pre-trained GFMs. The results show a significant gap between GFMs current capabilities and accurate GV representation. We hope this dataset will help advance genomic deep learning to bridge this gap.

翻译：遗传变异（GV）被定义为个体间DNA序列的差异，在遗传疾病的诊断和治疗中起着至关重要的作用。下一代测序成本的迅速下降导致患者层面的GV数据呈指数级增长。这种增长给临床医生带来了挑战，他们必须高效地优先处理患者特异性GV，并将其与现有基因组数据库整合，以指导患者管理。为了解决GV的解读问题，基因组基础模型（GFMs）应运而生。然而，这些模型缺乏标准化的性能评估，导致模型评估存在相当大的差异。这就提出了一个问题：深度学习方法在分类未知GV并将其与临床验证的GV对齐方面效果如何？我们认为，表示学习——将原始数据转化为有意义的特征空间——是解决索引和分类挑战的有效方法。我们引入了一个名为GV-Rep的大规模遗传变异数据集，它具有可变长度的上下文和详细的注释，专为深度学习模型设计，以学习跨不同性状、疾病、组织类型和实验背景的GV表示。我们的贡献有三方面：（i）构建了一个包含700万条记录的综合性数据集，每条记录都标注了相应变异的特征，以及来自1,107种细胞类型的17,548次基因敲除测试、1,808种变异组合和来自真实世界患者的156个独特临床验证GV的额外数据。（ii）分析了数据集的结构和属性。（iii）使用预训练的GFMs对数据集进行了实验。结果表明，GFMs当前的能力与准确的GV表示之间存在显著差距。我们希望这个数据集能有助于推进基因组深度学习，以弥合这一差距。