LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand Débat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.
翻译:大语言模型在现代自然语言处理中无处不在,尽管其适用性已扩展至为民主活动(如在线审议或大规模公民协商)生成的文本,但将其用作分析工具已引发伦理质疑。我们延续这一研究方向,主要目标有二:(a) 开发能够在语用层面帮助标准化公共论坛中公民贡献的资源,使其更易于用于主题建模与政治分析;(b) 研究这种标准化工作能在多大程度上由小型开放权重大语言模型可靠地完成,即那些可在本地以有限资源透明运行的模型。为此,我们提出语料库澄清作为大规模协商数据的预处理框架,将嘈杂、多主题的贡献转化为结构化、自包含的论证单元,以供下游分析使用。我们发布了GDN-CC数据集——一个包含1,231条法国"全国大辩论"贡献的手工标注数据集,涵盖2,285个经论证结构标注与人工澄清的论证单元。实验表明,经过微调的小型语言模型在复现这些标注任务上达到或超越了大语言模型的性能,并通过意见聚类任务评估了其实用性。我们最终开源了GDN-CC-large——包含24万条贡献的自动标注语料库,这是迄今规模最大的标注民主协商数据集。