Ask the experts: sourcing high-quality datasets for nutritional counselling through Human-AI collaboration

Large Language Models (LLMs), with their flexible generation abilities, can be powerful data sources in domains with few or no available corpora. However, problems like hallucinations and biases limit such applications. In this case study, we pick nutrition counselling, a domain lacking any public resource, and show that high-quality datasets can be gathered by combining LLMs, crowd-workers and nutrition experts. We first crowd-source and cluster a novel dataset of diet-related issues, then work with experts to prompt ChatGPT into producing related supportive text. Finally, we let the experts evaluate the safety of the generated text. We release HAI-coaching, the first expert-annotated nutrition counselling dataset containing ~2.4K dietary struggles from crowd workers, and ~97K related supportive texts generated by ChatGPT. Extensive analysis shows that ChatGPT while producing highly fluent and human-like text, also manifests harmful behaviours, especially in sensitive topics like mental health, making it unsuitable for unsupervised use.

翻译：大型语言模型凭借其灵活的生成能力，可在缺乏现成语料库的领域中成为强大的数据来源。然而，幻觉现象和偏见等问题限制了此类应用。在本案例研究中，我们选取营养咨询这一缺乏任何公共资源的领域，展示了通过结合大型语言模型、众包工作者和营养专家可获取高质量数据集。我们首先众包采集并聚类了一个新颖的饮食相关问题数据集，随后与专家合作引导ChatGPT生成相关的支持性文本，最终由专家评估生成文本的安全性。我们发布了HAI-coaching——首个经专家标注的营养咨询数据集，包含约2400条来自众包工作者的饮食困扰记录，以及约97000条由ChatGPT生成的关联支持性文本。大量分析表明，ChatGPT在生成高度流畅且类人文本的同时，在心理健康等敏感话题上仍表现出有害行为，使其不适合无监督使用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日