This paper introduces GELATO (Government, Executive, Legislative, and Treaty Ontology), a dataset of U.S. House and Senate bills from the 118th Congress annotated using a novel two-level named entity recognition ontology designed for U.S. legislative texts. We fine-tune transformer-based models (BERT, RoBERTa) of different architectures and sizes on this dataset for first-level prediction. We then use LLMs with optimized prompts to complete the second level prediction. The strong performance of RoBERTa and relatively weak performance of BERT models, as well as the application of LLMs as second-level predictors, support future research in legislative NER or downstream tasks using these model combinations as extraction tools.
翻译:本文介绍GELATO(政府、行政、立法与条约本体)数据集,该数据集包含基于美国立法文本设计的新型双层命名实体识别本体标注的第118届国会众议院与参议院法案。我们在此数据集上对不同架构与规模的基于Transformer的模型(BERT、RoBERTa)进行微调以完成第一层级预测,随后采用具有优化提示的大型语言模型完成第二层级预测。RoBERTa模型的优异表现与BERT模型的相对弱势表现,以及大型语言模型作为第二层级预测器的应用,为未来使用此类模型组合作为抽取工具的立法命名实体识别或下游任务研究提供了支持。