Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.
翻译:市政会议纪要是地方治理的官方文件,其格式与写作风格呈现高度异质性。有效的信息检索需要识别会议编号、日期、地点、参与者、起止时间等元数据,而这些要素极少标准化或易于自动提取。现有的命名实体识别模型难以适应此任务,因其未针对此类领域特定类别进行适配。本文提出一种从市政纪要中提取元数据的两阶段流程:首先通过问答模型识别包含元数据的开篇与结尾文本片段;随后应用基于Transformer的模型(BERTimbau与XLM-RoBERTa,含/不含CRF层)进行细粒度实体提取,并通过去词汇化技术增强性能。为评估所提流程,我们对开源权重模型(Phi)与闭源权重模型(Gemini)进行了基准测试,综合评估预测性能、推理成本与碳足迹。实验结果表明,本方法在领域内表现优异,优于规模更大的通用大语言模型。然而,跨市政机构的评估揭示了泛化能力下降的问题,这反映了市政记录的多样性与语言复杂性。本研究首次建立了市政会议纪要元数据提取的基准,为该领域的后续研究奠定了坚实基础。