Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework \sys that optimizes MinHash LSH for GPU clusters and leverages computationally efficient and partially reusable non-cryptographic hash functions. \sys significantly outperforms the CPU-based deduplication tool included in SlimPajama by up to 58.3 times and the GPU-based deduplication tool included in NVIDIA NeMo Curator by up to 8.6 times when processing 1 million documents with a node of four GPUs. Deduplication of 1.2 trillion tokens is completed in just 5.1 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (https://github.com/mcrl/FED).
翻译:数据集去重在提升数据质量方面起着关键作用,最终能改善大语言模型的训练性能与效率。数据去重的一种常用方法是MinHash LSH算法。近期,英伟达提出了一种基于GPU的MinHash LSH去重方法,但其仍存在优化空间,处理效率有待进一步提升。本文提出一种GPU加速的去重框架\sys,该框架针对GPU集群优化了MinHash LSH算法,并采用了计算高效且部分可复用的非加密哈希函数。在处理100万份文档时,\sys在配备四块GPU的单节点上,其性能显著优于SlimPajama中包含的基于CPU的去重工具,最高可达58.3倍;同时优于英伟达NeMo Curator中包含的基于GPU的去重工具,最高可达8.6倍。在四节点、16块GPU的环境中,仅需5.1小时即可完成1.2万亿词元的去重处理。相关代码已在GitHub上公开(https://github.com/mcrl/FED)。