One common way to speed up the find operation within a set of text files involves a trigram index. This structure is merely a map from a trigram (sequence consisting of three characters) to a set of files which contain it. When searching for a pattern, potential file locations are identified by intersecting the sets related to the trigrams in the pattern. Then, the search proceeds only in these files. However, in a code repository, the trigram index evolves across different versions. Upon checking out a new version, this index is typically built from scratch, which is a time-consuming task, while we want our index to have almost zero-time startup. Thus, we explore the persistent version of a trigram index for full-text and key word patterns search. Our approach just uses the current version of the trigram index and applies only the changes between versions during checkout, significantly enhancing performance. Furthermore, we extend our data structure to accommodate CamelHump search for class and function names.
翻译:加速文本文档集合中查找操作的一种常见方法是使用三元组索引。该结构本质是一个映射,将三元组(三个字符组成的序列)映射到包含该三元组的文件集合。当搜索模式时,通过取模式中三元组对应集合的交集来确定可能的文件位置,随后仅在这些文件中进行搜索。然而,在代码仓库中,三元组索引会随版本演变。当检出新版本时,该索引通常需要从头构建,这是一个耗时的过程,而我们需要索引几乎零延迟启动。为此,我们探索了用于全文和关键词模式搜索的持久化三元组索引。我们的方法仅使用当前版本的三元组索引,并在检出时仅应用版本间的变更,显著提升了性能。此外,我们扩展了该数据结构以支持类名和函数名的驼峰式搜索。