Named Entity Recognition (NER) has seen significant progress in recent years, with numerous state-of-the-art (SOTA) models achieving high performance. However, very few studies have focused on the generation of entities' context. In this paper, we introduce CONTEXT-NER, a task that aims to generate the relevant context for entities in a sentence, where the context is a phrase describing the entity but not necessarily present in the sentence. To facilitate research in this task, we also present the EDGAR10-Q dataset, which consists of annual and quarterly reports from the top 1500 publicly traded companies. The dataset is the largest of its kind, containing 1M sentences, 2.8M entities, and an average of 35 tokens per sentence, making it a challenging dataset. We propose a baseline approach that combines a phrase generation algorithm with inferencing using a 220M language model, achieving a ROUGE-L score of 27% on the test split. Additionally, we perform a one-shot inference with ChatGPT, which obtains a 30% ROUGE-L, highlighting the difficulty of the dataset. We also evaluate models such as T5 and BART, which achieve a maximum ROUGE-L of 49% after supervised finetuning on EDGAR10-Q. We also find that T5-large, when pre-finetuned on EDGAR10-Q, achieve SOTA results on downstream finance tasks such as Headline, FPB, and FiQA SA, outperforming vanilla version by 10.81 points. To our surprise, this 66x smaller pre-finetuned model also surpasses the finance-specific LLM BloombergGPT-50B by 15 points. We hope that our dataset and generated artifacts will encourage further research in this direction, leading to the development of more sophisticated language models for financial text analysis
翻译:摘要:命名实体识别(NER)近年来取得了显著进展,众多最先进(SOTA)模型实现了高性能。然而,针对实体上下文生成的研究仍十分有限。本文提出CONTEXT-NER任务,旨在为句子中的实体生成相关上下文——即描述该实体但未必出现在原句中的短语。为促进该任务研究,我们同时发布了EDGAR10-Q数据集,该数据集包含排名前1500的上市公司年度与季度报告,是同类数据集中规模最大的,涵盖100万条句子、280万个实体,平均每句35个词元,构成了具有挑战性的基准。我们提出了一种基线方法,将短语生成算法与220M参数语言模型的推理相结合,在测试集上取得了27%的ROUGE-L分数。此外,使用ChatGPT进行单次推理获得了30%的ROUGE-L分数,进一步凸显了数据集的难度。我们还评估了T5和BART等模型,经过在EDGAR10-Q上的监督微调后,这些模型最高达到了49%的ROUGE-L分数。值得注意的是,T5-large模型在EDGAR10-Q上预微调后,在Headline、FPB和FiQA SA等下游金融任务中取得SOTA结果,较原版模型提升10.81分。令人惊讶的是,这个体积小66倍的预微调模型甚至超越了金融专用大语言模型BloombergGPT-50B,领先15分。我们期望本数据集及生成成果能推动该领域深入研究,从而发展出更精密的金融文本分析语言模型。