With the rapid growth of textual content on the Internet, efficient large-scale semantic text retrieval has garnered increasing attention from both academia and industry. Text hashing, which projects original texts into compact binary hash codes, is a crucial method for this task. By using binary codes, the semantic similarity computation for text pairs is significantly accelerated via fast Hamming distance calculations, and storage costs are greatly reduced. With the advancement of deep learning, deep text hashing has demonstrated significant advantages over traditional, data-independent hashing techniques. By leveraging deep neural networks, these methods can learn compact and semantically rich binary representations directly from data, overcoming the performance limitations of earlier approaches. This survey investigates current deep text hashing methods by categorizing them based on their core components: semantic extraction, hash code quality preservation, and other key technologies. We then present a detailed evaluation schema with results on several popular datasets, followed by a discussion of practical applications and open-source tools for implementation. Finally, we conclude by discussing key challenges and future research directions, including the integration of deep text hashing with large language models to further advance the field. The project for this survey can be accessed at https://github.com/hly1998/DeepTextHashing.
翻译:随着互联网文本内容的快速增长,高效的大规模语义文本检索日益受到学术界和工业界的关注。文本哈希技术将原始文本映射为紧凑的二进制哈希码,是实现该任务的关键方法。通过使用二进制编码,文本对之间的语义相似度计算可通过快速汉明距离计算显著加速,同时存储成本大幅降低。随着深度学习的发展,深度文本哈希相较于传统的数据无关哈希技术展现出显著优势。这些方法利用深度神经网络直接从数据中学习紧凑且语义丰富的二进制表示,克服了早期方法的性能局限。本综述通过按核心组件(语义提取、哈希码质量保持及其他关键技术)对现有深度文本哈希方法进行分类研究,随后提出详细的评估框架并展示在多个流行数据集上的实验结果,进而讨论实际应用及开源实现工具。最后,我们总结关键挑战与未来研究方向,包括深度文本哈希与大型语言模型的融合以推动该领域进一步发展。本综述相关项目可通过 https://github.com/hly1998/DeepTextHashing 访问。