At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model's understanding and simultaneously addressing the important topic of multilingualism. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.
翻译:随着大语言模型能力以惊人速度提升,构建能够经得起未来考验的评估数据集来检验其理解能力变得日益困难。本文提出一种新颖的大语言模型评估范式,其核心思想是:对世界的正确理解应当在同一含义的不同(弗雷格式)义涵之间保持一致。据此,我们并非通过正确性来度量理解,而是通过评估模型自身生成的多个义涵间的一致性来判定理解水平。我们通过构建一个以不同语言作为不同义涵的测试案例来展示该方法,从而将多语言自洽性作为模型理解的试金石,同时切入多语言这一重要议题。以ChatGPT最新版本为研究对象,我们评估了其在三种语言、两项任务上的多语言一致性。研究发现其多语言一致性仍显不足,任务与世界的理解尚未达到语言无关性。由于该方法无需除英语外的任何静态评估语料库,因此能够轻松且低成本地扩展至不同语言与任务,有望成为未来基准测试工作的重要组成部分。