Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.
翻译:条件语言模型的内在评估指标,如困惑度或每字符比特数,在单语言和多语言场景中均被广泛采用。这些指标在单语言设置中的使用和比较相对直接,但在多语言设置中则基于一系列假设。其中一个假设是:由于信息内容(此处理解为语义含义)相同,比较条件语言模型在平行句上的困惑度即可反映其质量。然而,这些指标本质上是从信息论角度衡量信息内容。我们明确阐述了此假设及其他类似假设,并探讨了其影响。我们在两个多平行语料库上,使用单语言和多语言模型,对六种指标进行了实验。最终,我们发现现有指标并非普遍可比。我们借鉴形式与意义的理论辩论,为此现象提供了一些解释。