At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model's understanding and simultaneously addressing the important topic of multilinguality. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.
翻译:随着大型语言模型(LLMs)能力的惊人提升,构建经得起未来考验的评估集以评估其理解能力变得愈发具有挑战性。本文提出一种新颖的LLM评估范式,其核心理念是:对世界的正确理解应在同一含义的不同(弗雷格式)感官间保持一致。据此,我们不依据正确性衡量理解,而是通过评估模型自身生成的多种感官间的自洽性来度量理解。我们通过实例化一项测试来展示该方法:将不同语言视为不同感官,以多语言自洽性作为模型理解的试金石,同时直面多语言性这一重要议题。以最新版本的ChatGPT为研究对象,我们评估了其在三项语言、两项任务中的多语言自洽性。结果表明,其多语言自洽性仍显不足,任务理解与世界理解均未实现语言无关性。由于该方法无需任何英语以外的静态评估语料库,因此可低成本、便捷地扩展至不同语言与任务,并有望成为未来基准测试工作的重要组成。