Overlaps between words are crucial in many areas of computer science, such as code design, stringology, and bioinformatics. A self overlapping word is characterized by its periods and borders. A period of a word $u$ is the starting position of a suffix of $u$ that is also a prefix $u$, and such a suffix is called a border. Each word of length, say $n>0$, has a set of periods, but not all combinations of integers are sets of periods. Computing the period set of a word $u$ takes linear time in the length of $u$. We address the question of computing, the set, denoted $\Gamma_n$, of all period sets of words of length $n$. Although period sets have been characterized, there is no formula to compute the cardinality of $\Gamma_n$ (which is exponential in $n$), and the known dynamic programming algorithm to enumerate $\Gamma_n$ suffers from its space complexity. We present an incremental approach to compute $\Gamma_n$ from $\Gamma_{n-1}$, which reduces the space complexity, and then a constructive certification algorithm useful for verification purposes. The incremental approach defines a parental relation between sets in $\Gamma_{n-1}$ and $\Gamma_n$, enabling one to investigate the dynamics of period sets, and their intriguing statistical properties. Moreover, the period set of a word $u$ is the key for computing the absence probability of $u$ in random texts. Thus, knowing $\Gamma_n$ is useful to assess the significance of word statistics, such as the number of missing words in a random text.
翻译:在计算机科学的诸多领域,如编码设计、字符串学及生物信息学中,词之间的重叠至关重要。自重叠词的特征由其周期与边界所刻画。词$u$的一个周期是指$u$的某个后缀的起始位置,该后缀同时是$u$的前缀,这样的后缀称为边界。每个长度为$n>0$的词都拥有一个周期集合,但并非所有整数的组合都能成为周期集合。计算词$u$的周期集所需时间与$u$的长度呈线性关系。本文探讨如何计算所有长度为$n$的词的周期集所构成的集合,记为$\Gamma_n$。尽管周期集的特征已被描述,但目前尚无公式可计算$\Gamma_n$的基数(其随$n$呈指数增长),且已知用于枚举$\Gamma_n$的动态规划算法受限于其空间复杂度。我们提出一种从$\Gamma_{n-1}$增量计算$\Gamma_n$的方法,以降低空间复杂度,并给出一种可用于验证目的的构造性验证算法。该增量方法定义了$\Gamma_{n-1}$与$\Gamma_n$中集合间的父代关系,从而能够研究周期集的动态特性及其引人入胜的统计性质。此外,词$u$的周期集是计算$u$在随机文本中缺失概率的关键。因此,了解$\Gamma_n$有助于评估词统计量的显著性,例如随机文本中缺失词的数量。