Foodborne illnesses significantly impact public health. Deep learning surveillance applications using social media data aim to detect early warning signals. However, labeling foodborne illness-related tweets for model training requires extensive human resources, making it challenging to collect a sufficient number of high-quality labels for tweets within a limited budget. The severe class imbalance resulting from the scarcity of foodborne illness-related tweets among the vast volume of social media further exacerbates the problem. Classifiers trained on a class-imbalanced dataset are biased towards the majority class, making accurate detection difficult. To overcome these challenges, we propose EGAL, a deep learning framework for foodborne illness detection that uses small expert-labeled tweets augmented by crowdsourced-labeled and massive unlabeled data. Specifically, by leveraging tweets labeled by experts as a reward set, EGAL learns to assign a weight of zero to incorrectly labeled tweets to mitigate their negative influence. Other tweets receive proportionate weights to counter-balance the unbalanced class distribution. Extensive experiments on real-world \textit{TWEET-FID} data show that EGAL outperforms strong baseline models across different settings, including varying expert-labeled set sizes and class imbalance ratios. A case study on a multistate outbreak of Salmonella Typhimurium infection linked to packaged salad greens demonstrates how the trained model captures relevant tweets offering valuable outbreak insights. EGAL, funded by the U.S. Department of Agriculture (USDA), has the potential to be deployed for real-time analysis of tweet streaming, contributing to foodborne illness outbreak surveillance efforts.
翻译:食源性疾病对公共卫生构成重大影响。基于社交媒体数据的深度学习监测应用旨在检测早期预警信号。然而,为模型训练标注食源性疾病相关推文需要大量人力资源,使得在有限预算内收集足够数量的高质量标注推文变得困难。由于食源性疾病相关推文在庞大的社交媒体数据中极为稀缺,导致的严重类别不平衡问题进一步加剧了这一困境。在类别不平衡数据集上训练的分类器会偏向多数类,从而难以实现准确检测。为克服这些挑战,我们提出EGAL——一种用于食源性疾病检测的深度学习框架,该框架利用小规模专家标注推文,并通过众包标注和大量无标注数据进行增强。具体而言,通过将专家标注推文作为奖励集,EGAL能够学习为错误标注的推文分配零权重以减轻其负面影响,同时为其他推文分配成比例权重以平衡不平衡的类别分布。在真实世界的\textit{TWEET-FID}数据集上进行的大量实验表明,EGAL在不同设置(包括不同专家标注集规模和类别不平衡比率)下均优于强基线模型。针对一起与袋装沙拉绿叶菜相关的鼠伤寒沙门氏菌多州暴发事件的案例研究,展示了训练模型如何捕捉提供有价值暴发洞察的相关推文。EGAL由美国农业部(USDA)资助,有望部署于推文流的实时分析,助力食源性疾病暴发监测工作。