Contaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a tf-idf representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting.
翻译:受污染或掺假食品对人类健康构成重大风险。给定标注过的网络文本训练集,机器学习和自然语言处理可用于自动检测此类风险。我们发布了一个包含7,546条描述公开食品召回公告的短文本数据集。每条文本均按两种粒度层级(粗粒度和细粒度)人工标注了召回对应的食品类别与危害类型。我们对该数据集进行了描述,并基准测试了朴素模型、传统模型及Transformer模型。基于分析,在低样本量类别上,基于tf-idf表示的逻辑回归模型表现优于RoBERTa和XLM-R。最后,我们探讨了不同提示策略,并提出了一种基于共形预测的大语言模型协同框架,该框架在提升基础分类器性能的同时,相较于常规提示方法降低了能耗。