Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.
翻译:以ChatGPT或Bard为代表的大型语言模型(LLM)彻底改变了信息检索方式,其能够在极短时间内针对任何主题生成定制化回答的能力吸引了广泛关注。本文评估了多种LLM在法语历史事实相关问题中生成可靠、全面且具有足够相关性回答的能力。为此,我们构建了一个包含大量历史相关问题的测试集,这些问题涵盖不同类型、主题和难度级别。通过对十个选定LLM回答的评估,我们发现其在内容和形式方面均存在诸多缺陷。除了整体准确率不足之外,我们还指出LLM对法语的处理存在不均衡性,以及回答中存在的冗长和不一致问题。