Large language models (LLMs) are increasingly used to support the analysis of complex financial disclosures, yet their reliability, behavioral consistency, and transparency remain insufficiently understood in high-stakes settings. This paper presents a controlled evaluation of five transformer-based LLMs applied to question answering over the Business sections of U.S. 10-K filings. To capture complementary aspects of model behavior, we combine human evaluation, automated similarity metrics, and behavioral diagnostics under standardized and context-controlled prompting conditions. Human assessments indicate that models differ in their average performance across qualitative dimensions such as relevance, completeness, clarity, conciseness, and factual accuracy, though inter-rater agreement is modest, reflecting the subjective nature of these criteria. Automated metrics reveal systematic differences in lexical overlap and semantic similarity across models, while behavioral diagnostics highlight variation in response stability and cross-prompt alignment. Importantly, no single model consistently dominates across all evaluation perspectives. Together, these findings suggest that apparent performance differences should be interpreted as relative tendencies under the tested conditions rather than definitive indicators of general reliability. The results underscore the need for evaluation frameworks that account for human disagreement, behavioral variability, and interpretability when deploying LLMs in financially consequential applications.
翻译:大型语言模型(LLM)正日益被用于支持复杂财务披露信息的分析,然而在高风险场景中,其可靠性、行为一致性和透明度仍未被充分理解。本文对五种基于Transformer的LLM在美国10-K文件业务章节的问答任务上进行了受控评估。为捕捉模型行为的互补性特征,我们在标准化和上下文受控的提示条件下,结合了人工评估、自动化相似性度量以及行为诊断方法。人工评估表明,模型在相关性、完整性、清晰度、简洁性和事实准确性等定性维度上的平均表现存在差异,但评分者间一致性较低,这反映了这些标准的主观性。自动化度量揭示了模型在词汇重叠和语义相似性方面存在系统性差异,而行为诊断则凸显了响应稳定性和跨提示对齐性的变化。重要的是,没有单一模型在所有评估维度上持续占优。综合来看,这些发现表明,观察到的性能差异应被解释为测试条件下的相对倾向,而非普遍可靠性的明确指标。研究结果强调,在金融关键应用中部署LLM时,需要建立能够考虑人工评分分歧、行为变异性和可解释性的评估框架。