This study provides the first comprehensive assessment of consistency and reproducibility in Large Language Model (LLM) outputs in finance and accounting research. We evaluate how consistently LLMs produce outputs given identical inputs through extensive experimentation with 50 independent runs across five common tasks: classification, sentiment analysis, summarization, text generation, and prediction. Using three OpenAI models (GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse financial source texts and data, covering MD&As, FOMC statements, finance news articles, earnings call transcripts, and financial statements. Our findings reveal substantial but task-dependent consistency, with binary classification and sentiment analysis achieving near-perfect reproducibility, while complex tasks show greater variability. More advanced models do not consistently demonstrate better consistency and reproducibility, with task-specific patterns emerging. LLMs significantly outperform expert human annotators in consistency and maintain high agreement even where human experts significantly disagree. We further find that simple aggregation strategies across 3-5 runs dramatically improve consistency. Simulation analysis reveals that despite measurable inconsistency in LLM outputs, downstream statistical inferences remain remarkably robust. These findings address concerns about what we term "G-hacking," the selective reporting of favorable outcomes from multiple Generative AI runs, by demonstrating that such risks are relatively low for finance and accounting tasks.
翻译:本研究首次对金融与会计研究中大型语言模型(LLM)输出的一致性与可复现性进行了全面评估。我们通过五项常见任务(分类、情感分析、文本摘要、文本生成和预测)的50次独立运行实验,系统评估了LLM在相同输入条件下输出结果的一致性。使用三种OpenAI模型(GPT-3.5-turbo、GPT-4o-mini和GPT-4o),我们从多样化金融源文本和数据(涵盖管理层讨论与分析、联邦公开市场委员会声明、金融新闻文章、财报电话会议记录及财务报表)中生成超过340万个输出结果。研究发现:LLM输出存在显著但任务依赖的一致性,二元分类和情感分析任务达到近乎完美的可复现性,而复杂任务则表现出更大变异性;更先进的模型并未持续展现更好的一致性,而是呈现任务特异性模式;LLM在一致性方面显著优于人类专家标注者,即使在专家存在显著分歧的领域仍保持高度一致性。进一步研究发现,通过3-5次运行的简单聚合策略可显著提升一致性。模拟分析表明,尽管LLM输出存在可测量的不一致性,下游统计推断仍保持显著稳健性。这些发现通过证明“G-hacking”(即从多次生成式AI运行中选择性报告有利结果)在金融与会计任务中的风险相对较低,有效回应了学界对此类问题的担忧。