Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard

A comparison between three chatbots which are based on large language models, namely ChatGPT-3.5, ChatGPT-4 and Google Bard is presented, focusing on their ability to give correct answers to mathematics and logic problems. In particular, we check their ability to Understand the problem at hand; Apply appropriate algorithms or methods for its solution; and Generate a coherent response and a correct answer. We use 30 questions that are clear, without any ambiguities, fully described with plain text only, and have a unique, well defined correct answer. The questions are divided into two sets of 15 each. The questions of Set A are 15 "Original" problems that cannot be found online, while Set B contains 15 "Published" problems that one can find online, usually with their solution. Each question is posed three times to each chatbot. The answers are recorded and discussed, highlighting their strengths and weaknesses. It has been found that for straightforward arithmetic, algebraic expressions, or basic logic puzzles, chatbots may provide accurate solutions, although not in every attempt. However, for more complex mathematical problems or advanced logic tasks, their answers, although written in a usually "convincing" way, may not be reliable. Consistency is also an issue, as many times a chatbot will provide conflicting answers when given the same question more than once. A comparative quantitative evaluation of the three chatbots is made through scoring their final answers based on correctness. It was found that ChatGPT-4 outperforms ChatGPT-3.5 in both sets of questions. Bard comes third in the original questions of Set A, behind the other two chatbots, while it has the best performance (first place) in the published questions of Set B. This is probably because Bard has direct access to the internet, in contrast to ChatGPT chatbots which do not have any communication with the outside world.

翻译：本文对三种基于大型语言模型的聊天机器人——ChatGPT-3.5、ChatGPT-4 与 Google Bard——进行了比较，重点关注它们在数学与逻辑问题上的正确作答能力。具体而言，我们检验了它们：理解问题的能力；应用适当算法或方法求解问题的能力；生成连贯答案与正确结果的能力。我们采用了30个清晰无误、仅用纯文本完整描述且具有唯一确定正确答案的问题。这些问题分为两组，每组15个。A组问题为15个无法在网上找到的"原创"问题，而B组包含15个可在网上找到的"已发表"问题（通常附有解答）。每个问题对每个聊天机器人各提出三次。我们对答案进行了记录与讨论，突出其优势与不足。结果发现，对于简单的算术、代数表达式或基本逻辑谜题，聊天机器人能够提供准确解答，但并非每次尝试均如此。然而，对于更复杂的数学问题或高级逻辑任务，它们的答案虽通常以"令人信服"的方式写出，却可能并不可靠。一致性也是一个问题，因为当同一问题被多次提出时，聊天机器人常会给出相互矛盾的答案。通过根据正确性对最终答案进行评分，我们对三种聊天机器人进行了定量比较评估。结果发现，ChatGPT-4 在两组问题中的表现均优于 ChatGPT-3.5。在A组原创问题中，Bard 排名第三，落后于其他两种聊天机器人；而在B组已发表问题中，其表现最佳（排名第一）。这很可能是因为 Bard 可直接访问互联网，而 ChatGPT 系列聊天机器人则无法与外部世界进行任何通信。