Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37,491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1,183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same human and mouse samples from five genomic assays, allowing direct comparison of feature representations. The proposed dataset includes a binary quality label, derived from automated quality control and domain experts. Among all samples, $3.2\%$ are of low quality. Supervised machine learning algorithms accurately predicted quality labels from the features, confirming the relevance of the provided feature representations. The proposed feature representations enable researchers to study how different feature types (QC-34 vs. BL features) and granularities (varying number of BL features) affect the detection of quality problems.
翻译:下一代测序(next-generation sequencing, NGS)是研究生物体DNA和RNA的关键技术。然而,在不同实验设置下识别NGS数据中的质量问题仍具挑战性。为开发自动化质量控制工具,研究人员需要具备能够捕捉质量问题特征的数据集。现有NGS数据库仅提供有限数量的质量相关特征。为填补这一空白,我们提出了一个源自37,491个NGS样本的数据集,包含两类质量相关的特征表示:第一类由质量控制工具导出的34个特征(QC-34特征)构成;第二类包含数量不等的特征(从8个到1,183个),这些特征源自ENCODE阻断清单(ENCODE blocklist)所识别的问题基因组区域的读取计数(BL特征)。所有特征均描述了来自五种基因组检测的同一批人类和小鼠样本,从而可直接比较不同特征表示。该数据集包含由自动化质控和领域专家共同导出的二元质量标签。在所有样本中,$3.2\%$为低质量样本。监督式机器学习算法能够基于这些特征准确预测质量标签,证实了所提供的特征表示的相关性。该特征表示使研究人员能够研究不同类型特征(QC-34特征与BL特征)及不同粒度(BL特征数量变化)对质量问题检测效果的影响。