Municipal meeting minutes are formal records documenting the discussions and decisions of local government, yet their content is often lengthy, dense, and difficult for citizens to navigate. Automatic summarization can help address this challenge by producing concise summaries for each discussion subject. Despite its potential, research on summarizing discussion subjects in municipal meeting minutes remains largely unexplored, especially in low-resource languages, where the inherent complexity of these documents adds further challenges. A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain. In this paper, we present CitiLink-Summ, a new corpus of European Portuguese municipal meeting minutes, comprising 100 documents and 2,322 manually hand-written summaries, each corresponding to a distinct discussion subject. Leveraging this dataset, we establish baseline results for automatic summarization in this domain, employing state-of-the-art generative models (e.g., BART, PRIMERA) as well as large language models (LLMs), evaluated with both lexical and semantic metrics such as ROUGE, BLEU, METEOR, and BERTScore. CitiLink-Summ provides the first benchmark for municipal-domain summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts.
翻译:市政会议纪要是记录地方政府讨论与决策的正式文件,但其内容通常冗长、密集,不利于公民查阅。自动摘要技术可为每个讨论议题生成简明摘要,从而应对这一挑战。尽管潜力巨大,针对市政会议纪要中讨论议题的摘要研究仍基本处于空白状态,尤其在低资源语言中,这类文档固有的复杂性带来了更多挑战。一个主要瓶颈是缺乏包含高质量人工撰写摘要的数据集,这限制了该领域有效摘要模型的开发与评估。本文提出CitiLink-Summ——一个全新的欧洲葡萄牙语市政会议纪要语料库,包含100份文档及2,322条人工撰写的摘要,每条摘要对应一个独立的讨论议题。基于此数据集,我们采用最先进的生成模型(如BART、PRIMERA)及大语言模型(LLMs),通过ROUGE、BLEU、METEOR和BERTScore等词汇与语义指标进行评估,为该领域自动摘要任务建立了基线结果。CitiLink-Summ为欧洲葡萄牙语的市政领域摘要研究提供了首个基准,为推动复杂行政文本的自然语言处理研究提供了宝贵资源。