The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks

"Citizen queries" are questions asked by an individual about government policies, guidance, and services that are relevant to their circumstances, encompassing a range of topics including benefits, taxes, immigration, employment, public health, and more. This represents a compelling use case for Large Language Models (LLMs) that respond to citizen queries with information that is adapted to a user's context and communicated according to their needs. However, in this use case, any misinformation could have severe, negative, likely invisible ramifications for an individual placing their trust in a model's response. To this effect, we introduce CitizenQuery-UK, a benchmark dataset of 22 thousand pairs of citizen queries and responses that have been synthetically generated from the swathes of public information on $gov.uk$ about government in the UK. We present the curation methodology behind CitizenQuery-UK and an overview of its contents. We also introduce a methodology for the benchmarking of LLMs with the dataset, using an adaptation of FActScore to benchmark 11 models for factuality, abstention frequency, and verbosity. We document these results, and interpret them in the context of the public sector, finding that: (i) there are distinct performance profiles across model families, but each is competitive; (ii) high variance undermines utility; (iii) abstention is low and verbosity is high, with implications on reliability; and (iv) more trustworthy AI requires acknowledged "fallibility" in the way it interacts with users. The contribution of our research lies in assessing the trustworthiness of LLMs in citizen query tasks; as we see a world of increasing AI integration into day-to-day life, our benchmark, built entirely on open data, lays the foundations for better evidenced decision-making regarding AI and the public sector.

翻译：“公民查询”是指个人针对与其自身情况相关的政府政策、指南和服务所提出的问题，涵盖福利、税收、移民、就业、公共卫生等一系列主题。这为大语言模型（LLMs）提供了一个极具价值的应用场景：模型需根据用户背景提供适配信息，并按其需求进行沟通。然而，在此场景中，任何错误信息都可能对信任模型回答的个人产生严重、负面且通常难以察觉的后果。为此，我们推出了 CitizenQuery-UK，这是一个包含 2.2 万对公民查询与回复的基准数据集，这些数据基于 $gov.uk$ 上关于英国政府的大量公开信息合成生成。我们介绍了 CitizenQuery-UK 背后的构建方法及其内容概览。同时，我们提出了一套基于该数据集的 LLM 基准测试方法，采用改进的 FActScore 对 11 个模型在事实准确性、拒绝回答频率和回答冗长度三个方面进行测评。我们记录了这些结果，并结合公共部门背景进行解读，发现：（i）不同模型系列表现出明显的性能特征，但各具竞争力；（ii）高方差削弱了其实用性；（iii）拒绝回答率低而回答冗长度高，这对可靠性产生影响；（iv）更可信的人工智能需要在其与用户交互方式中承认自身的“可错性”。本研究的贡献在于评估 LLMs 在公民查询任务中的可信度；随着人工智能日益融入日常生活，我们完全基于开放数据构建的这一基准，为人工智能与公共部门相关决策提供了更扎实的证据基础。