Although Large Language Models (LLMs) exhibit remarkable adaptability across domains, these models often fall short in structured knowledge extraction tasks such as named entity recognition (NER). This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets. Our approach diverges from the basic class-conditional prompts by instructing LLMs to self-reflect on the specific domain, thereby generating domain-relevant attributes (such as category and emotions for movie reviews), which are utilized for creating attribute-rich training data. Furthermore, we preemptively generate entity terms and then develop NER context data around these entities, effectively bypassing the LLMs' challenges with complex structures. Our experiments across both general and niche domains reveal significant performance enhancements over conventional data generation methods while being more cost-effective than existing alternatives.
翻译:尽管大型语言模型(LLMs)在跨领域任务中展现出显著的适应性,但这些模型在命名实体识别(NER)等结构化知识提取任务中往往表现不足。本文探索了一种创新且经济高效的策略,利用NER能力适中的LLMs来生成更优质的NER数据集。我们的方法不同于基本的类别条件提示,而是引导LLMs对特定领域进行自我反思,从而生成与领域相关的属性(如电影评论中的类别和情感),并利用这些属性创建富含特征的训练数据。此外,我们预先生成实体术语,然后围绕这些实体构建NER上下文数据,有效规避了LLMs在处理复杂结构时的挑战。我们在通用领域和垂直领域的实验表明,与传统数据生成方法相比,该方法在性能上显著提升,同时比现有替代方案更具成本效益。