We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.
翻译:本文提出了一种超高速、灵活的搜索算法,能够在0.3秒内完成对万亿规模自然语言语料库的搜索,同时处理语义变异(替换、插入和删除)。我们的方法采用基于后缀数组的字符串匹配技术,其扩展性随语料库规模增长而保持良好。为缓解查询语义松弛引发的组合爆炸问题,本方法基于两个关键算法思想构建:通过磁盘感知设计实现的快速精确查找,以及动态的语料库感知剪枝。我们从理论上证明,所提方法通过利用自然语言的统计特性,能够抑制搜索空间随查询长度呈指数级增长的趋势。在FineWeb-Edu语料库(Lozhkov等人,2024)(1.4T词元)上的实验表明,本方法的搜索延迟显著低于现有方法:infini-gram(Liu等人,2024)、infini-gram mini(Xu等人,2025)和SoftMatcha(Deguchi等人,2025)。作为实际应用案例,我们展示了本方法能够识别训练语料库中现有方法未发现的基准污染问题。我们还提供了支持七种语言语料库的快速软搜索在线演示。