We present a methodology for extracting structured risk factors from corporate 10-K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three-stage pipeline combines LLM extraction with supporting quotes, embedding-based semantic mapping to taxonomy categories, and LLM-as-a-judge validation that filters spurious assignments. To evaluate our approach, we extract 10,688 risk factors from S&P 500 companies and examine risk profile similarity across industry clusters. Beyond extraction, we introduce autonomous taxonomy maintenance where an AI agent analyzes evaluation feedback to identify problematic categories, diagnose failure patterns, and propose refinements, achieving 104.7% improvement in embedding separation in a case study. External validation confirms the taxonomy captures economically meaningful structure: same-industry companies exhibit 63% higher risk profile similarity than cross-industry pairs (Cohen's d=1.06, AUC 0.82, p<0.001). The methodology generalizes to any domain requiring taxonomy-aligned extraction from unstructured text, with autonomous improvement enabling continuous quality maintenance and enhancement as systems process more documents.
翻译:本文提出一种从企业10-K文件中提取结构化风险因素的方法,该方法严格遵循预定义的层次分类法。我们的三阶段流程将LLM提取与支撑性引文相结合,通过基于嵌入的语义映射实现分类法类别匹配,并采用LLM作为评判器进行验证以过滤错误分配。为评估该方法,我们从标普500指数成分股公司中提取了10,688个风险因素,并考察了行业集群间的风险特征相似性。在提取功能之外,我们引入了自主分类法维护机制:AI智能体通过分析评估反馈来识别问题类别、诊断失败模式并提出改进方案,在案例研究中实现了嵌入分离度104.7%的提升。外部验证证实该分类法能捕捉具有经济意义的结构:同行业公司的风险特征相似度比跨行业公司对高出63%(Cohen's d=1.06,AUC 0.82,p<0.001)。该方法可推广至任何需要从非结构化文本进行分类法对齐提取的领域,其自主改进机制能够在系统处理更多文档时实现持续的质量维护与提升。