Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.
翻译:大型语言模型(LLMs)正日益被视为通过零样本提示实现分类任务的灵活替代方案,以取代经典机器学习模型。然而,其在结构化表格数据上的适用性仍待深入探究,尤其是在金融风险评估等高风险金融应用中。本研究针对真实世界的贷款违约预测任务,系统比较了基于零样本LLM的分类器与LightGBM(一种先进的梯度提升模型)的表现。我们评估了它们的预测性能,使用SHAP分析了特征归因,并评估了LLM生成的自解释的可靠性。尽管LLMs能够识别关键的金融风险指标,但其特征重要性排序与LightGBM存在显著差异,且其自解释往往无法与经验性SHAP归因保持一致。这些发现凸显了LLMs作为结构化金融风险预测独立模型的局限性,并引发对其自生成解释可信度的担忧。我们的结果强调,在风险敏感的金融环境中部署LLMs时,需要进行可解释性审计、与可解释模型的基线比较,以及引入人机协同监督。