In this paper, we present ECL, a novel multi-modal dataset containing the textual and numerical data from corporate 10K filings and associated binary bankruptcy labels. Furthermore, we develop and critically evaluate several classical and neural bankruptcy prediction models using this dataset. Our findings suggest that the information contained in each data modality is complementary for bankruptcy prediction. We also see that the binary bankruptcy prediction target does not enable our models to distinguish next year bankruptcy from an unhealthy financial situation resulting in bankruptcy in later years. Finally, we explore the use of LLMs in the context of our task. We show how GPT-based models can be used to extract meaningful summaries from the textual data but zero-shot bankruptcy prediction results are poor. All resources required to access and update the dataset or replicate our experiments are available on github.com/henriarnoUG/ECL.
翻译:本文提出ECL,一个包含企业10-K申报文件文本与数值数据及对应二元破产标签的新型多模态数据集。我们利用该数据集开发并严格评估了多种经典与神经网络破产预测模型。研究结果表明,每种数据模态所含信息对破产预测具有互补性。我们还发现,二元破产预测目标无法使模型区分"次年破产"与"因财务不健康而于后续年份破产"两种情况。最后,我们探索了大型语言模型在此任务中的应用潜力,展示了基于GPT的模型如何从文本数据中提取有意义的摘要,但其零样本破产预测效果欠佳。访问和更新数据集或复现实验所需的全部资源已开源至github.com/henriarnoUG/ECL。