Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Akshay Gulati,Kanha Singhania,Tushar Banga,Parth Arora,Anshul Verma,Vaibhav Kumar Singh,Agyapal Digra,Jayant Singh Bisht,Danish Sharma,Varun Singla,Shubh Garg

from arxiv, 12 pages, 6 Figures, 5 Tables

Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.

翻译：大型语言模型在金融分析与投资研究中的应用日益广泛，然而对其金融推理能力的系统性评估仍显不足。本研究提出AI金融智能基准测试（AFIB），这是一个多维度的评估框架，旨在从五个维度评估金融分析能力：事实准确性、分析完整性、数据时效性、模型一致性和错误模式。我们使用源自实际股票研究任务的95个以上结构化金融分析问题数据集，对五种AI系统进行了评估：GPT、Gemini、Perplexity、Claude和SuperInvesting。结果显示不同模型间存在显著的性能差异。在此基准测试设定下，SuperInvesting取得了最高的综合性能，其平均事实准确性得分为8.96/10，完整性得分最高达56.65/70，同时在评估系统中表现出最低的幻觉率。以检索为导向的系统（如Perplexity）凭借实时信息获取能力在数据时效性任务上表现突出，但在分析综合性与一致性方面较弱。总体而言，研究结果强调大型语言模型的金融智能本质上是多维度的，那些将结构化金融数据访问与分析推理能力相结合的系统，在复杂投资研究工作流程中提供了最可靠的性能表现。