Natural language processing (NLP) has been widely used in quantitative finance, but traditional methods often struggle to capture rich narratives in corporate disclosures, leaving potentially informative signals under-explored. Large language models (LLMs) offer a promising alternative due to their ability to extract nuanced semantics. In this paper, we ask whether semantic signals extracted by LLMs from corporate disclosures predict alpha, defined as abnormal returns beyond broad market movements and common risk factors. We introduce a simple framework, LLM as extractor, embedding as ruler, which extracts context-aware, metric-focused textual spans and quantifies semantic changes across consecutive disclosure periods using embedding-based similarity. This allows us to measure the degree of metric shifting -- how much firms move away from previously emphasized metrics, referred as moving targets. In experiments with portfolio and cross-sectional regression tests against a recent NER-based baseline, our method achieves more than twice the risk-adjusted alpha and shows significantly stronger predictive power. Qualitative analysis suggests that these gains stem from preserving contextual qualifiers and filtering out non-metric terms that keyword-based approaches often miss.
翻译:自然语言处理(NLP)在量化金融领域已得到广泛应用,但传统方法往往难以捕捉企业披露中丰富的叙述性内容,导致大量潜在的信息信号未被充分探索。大型语言模型(LLMs)因其提取细微语义的能力,提供了一种前景广阔的替代方案。本文探讨LLMs从企业披露中提取的语义信号是否能预测阿尔法,即超越广泛市场波动和常见风险因子的超额收益。我们提出了一个简单的框架:LLM作为提取器,嵌入向量作为标尺。该框架提取上下文感知、以度量指标为核心的文本片段,并利用基于嵌入向量的相似度量化连续披露期间的语义变化。这使我们能够度量“指标偏移”的程度——即企业偏离先前强调的指标(称为移动目标)的幅度。在投资组合分析和横截面回归实验中,与近期基于命名实体识别(NER)的基线方法相比,我们的方法获得了超过两倍的风险调整后阿尔法,并展现出显著更强的预测能力。定性分析表明,这些优势源于对上下文限定词的保留以及对基于关键词方法常遗漏的非指标性术语的过滤。