Studying language models (LMs) in terms of well-understood formalisms allows us to precisely characterize their abilities and limitations. Previous work has investigated the representational capacity of recurrent neural network (RNN) LMs in terms of their capacity to recognize unweighted formal languages. However, LMs do not describe unweighted formal languages -- rather, they define probability distributions over strings. In this work, we study what classes of such probability distributions RNN LMs can represent, which allows us to make more direct statements about their capabilities. We show that simple RNNs are equivalent to a subclass of probabilistic finite-state automata, and can thus model a strict subset of probability distributions expressible by finite-state models. Furthermore, we study the space complexity of representing finite-state LMs with RNNs. We show that, to represent an arbitrary deterministic finite-state LM with $N$ states over an alphabet $\Sigma$, an RNN requires $\Omega\left(N |\Sigma|\right)$ neurons. These results present a first step towards characterizing the classes of distributions RNN LMs can represent and thus help us understand their capabilities and limitations.
翻译:通过以形式化且已深入理解的框架研究语言模型,可精确刻画其能力与局限性。已有工作基于无权重形式语言的识别能力,探究了循环神经网络(RNN)语言模型的表现力。然而,语言模型描述的并非无权重形式语言,而是定义字符串上的概率分布。本文研究RNN语言模型能够表示的概率分布类别,从而更直接地评估其能力。我们证明简单RNN等价于概率有限状态自动机的子类,因此仅能建模有限状态模型可表达概率分布的真子集。此外,本文还探究了用RNN表示有限状态语言模型的空间复杂度。研究表明,为表示一个具有$N$个状态、字母表为$\Sigma$的确定性有限状态语言模型,RNN需要$\Omega\left(N |\Sigma|\right)$个神经元。这些结果标志着在刻画RNN语言模型可表示分布类别的方向上迈出了第一步,有助于理解其能力与局限。