Several popular language models represent local contexts in an input text as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in x, with edges representing the (ordered) co-occurrence of two words within a sliding window of size w. However, this compressed representation is not generally bijective, and may introduce some degree of ambiguity. Some sequence graphs may admit several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and computational point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size w, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When w = 2, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the #P-hardness of enumeration. For a window of size at least 3, we prove hardness of all variants, even when w is considered as a constant, with the notable exception of the undirected/unweighted case for which we propose an XP algorithms for both (realizability and enumeration) problems, tight due to a corresponding W[1]-hardness result. We conclude with an integer program formulation to solve the realizability problem, and with dynamic programming to solve the enumeration problem. This work leaves open the membership to NP for both problems, a non-trivial question due to the existence of minimum realizations having exponential size on the instance encoding.
翻译:若干主流语言模型将输入文本中的局部上下文表示为词袋。此类表示自然编码为序列图,其顶点为输入x中出现的不同词汇,边表示滑动窗口(大小为w)内两个词汇的(有序)共现。然而,这种压缩表示并非普遍双射,可能引入一定程度的歧义。部分序列图可对应多种序列实现方式,而另一些则可能不存在任何实现。本文从组合与计算视角研究序列图的可实现性与歧义性,考虑多种设定下序列图实现的存在性与计数问题:窗口大小w、图的有向性(存在/缺失)、以及权重(多重性)的存在/缺失。当w=2时,除无向/加权设定(我们证明计数问题为#P-难)外,我们对所有情形给出可实现性与计数的多项式时间算法。对于窗口大小至少为3的情形,即使w视为常数,我们证明所有变体的困难性,但无向/非加权情形除外——该情形下我们为(可实现性与计数)两个问题提出XP算法,且该结果因对应的W[1]-困难性而具有紧性。最后,我们提出整数规划形式以解决可实现性问题,并采用动态规划解决计数问题。本文未解决两个问题是否属于NP类,此非平凡问题源于最小实现可能在实例编码上具有指数级规模。