Motivation: Despite significant advances in Third-Generation Sequencing (TGS) technologies, Next-Generation Sequencing (NGS) technologies remain dominant in the current sequencing market. This is due to the lower error rates and richer analytical software of NGS than that of TGS. NGS technologies generate vast amounts of genomic data including short reads, quality values and read identifiers. As a result, efficient compression of such data has become a pressing need, leading to extensive research efforts focused on designing FASTQ compressors. Previous researches show that lossless compression of quality values seems to reach its limits. But there remain lots of room for the compression of the reads part. Results: By investigating the characters of the sequencing process, we present a new algorithm for compressing reads in FASTQ files, which can be integrated into various genomic compression tools. We first reviewed the pipeline of reference-based algorithms and identified three key components that heavily impact storage: the matching positions of reads on the reference sequence(refpos), the mismatched positions of bases on reads(mispos) and the matching failed reads(unmapseq). To reduce their sizes, we conducted a detailed analysis of the distribution of matching positions and sequencing errors and then developed the three modules of AMGC. According to the experiment results, AMGC outperformed the current state-of-the-art methods, achieving an 81.23% gain in compression ratio on average compared with the second-best-performing compressor.
翻译:动机:尽管第三代测序技术取得了显著进展,但第二代测序技术凭借其更低的错误率和更丰富的分析软件,在当前测序市场中仍占主导地位。二代测序技术会产生海量基因组数据,包括短读段、质量值和读段标识符。因此,如何高效压缩此类数据已成为迫切需求,催生了大量专注于设计FASTQ压缩器的研究。先前研究表明,质量值的无损压缩似乎已接近极限,但读段部分仍存在大量压缩空间。结果:通过分析测序过程特征,我们提出了一种可集成至多种基因组压缩工具的新型FASTQ文件读段压缩算法。首先梳理了基于参考序列的压缩流程,识别出三种显著影响存储的关键要素:读段在参考序列上的匹配位置(refpos)、读段碱基错配位置(mispos)及匹配失败的读段(unmapseq)。为压缩其存储规模,我们深入分析了匹配位置分布与测序误差特征,进而开发出AMGC的三个功能模块。实验结果表明,AMGC性能优于当前最先进方法,与次优压缩器相比平均压缩率提升达81.23%。