This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.
翻译:本文提出了一种计算框架,旨在揭示法国电视与广播新闻所报道主题中性别分布存在的偏见。我们对2023年21个法国频道播出的总计11.7千小时新闻节目进行转录,构建数据集。采用大语言模型(LLM)的少样本对话模式对这些转录文本进行主题分类。基于LLM生成的标注结果,我们进一步探索微调专用小型分类模型的方法,以降低计算成本。为评估模型性能,我们构建并标注了包含804个对话的数据集,该数据集已免费公开供研究使用。研究发现,在体育、政治与冲突等主题中,女性发言时间显著低于整体平均水平;而在天气、商业广告及健康等主题中,女性发言时间则高于整体均值。此外,研究还观察到私营频道与公共服务频道在性别表征方面存在差异。