We evaluate the reliability of two chatbots, ChatGPT (4o and o1-preview versions), and Gemini Advanced, in providing references on financial literature and employing novel methodologies. Alongside the conventional binary approach commonly used in the literature, we developed a nonbinary approach and a recency measure to assess how hallucination rates vary with how recent a topic is. After analyzing 150 citations, ChatGPT-4o had a hallucination rate of 20.0% (95% CI, 13.6%-26.4%), while the o1-preview had a hallucination rate of 21.3% (95% CI, 14.8%-27.9%). In contrast, Gemini Advanced exhibited higher hallucination rates: 76.7% (95% CI, 69.9%-83.4%). While hallucination rates increased for more recent topics, this trend was not statistically significant for Gemini Advanced. These findings emphasize the importance of verifying chatbot-provided references, particularly in rapidly evolving fields.
翻译:我们评估了两种聊天机器人(ChatGPT的4o和o1-preview版本)以及Gemini Advanced在提供金融文献参考文献和采用新颖方法学方面的可靠性。除了文献中常用的传统二元方法外,我们开发了一种非二元方法和一项时效性度量,以评估幻觉率如何随主题的新近程度而变化。在分析了150条引用后,ChatGPT-4o的幻觉率为20.0%(95%置信区间为13.6%-26.4%),而o1-preview的幻觉率为21.3%(95%置信区间为14.8%-27.9%)。相比之下,Gemini Advanced表现出更高的幻觉率:76.7%(95%置信区间为69.9%-83.4%)。虽然对于较新主题的幻觉率有所上升,但这一趋势在Gemini Advanced中未达到统计学显著性。这些发现强调了验证聊天机器人提供参考文献的重要性,尤其是在快速发展的领域中。