We present Reddit Health Online Talk (RedHOT), a corpus of 22,000 richly annotated social media posts from Reddit spanning 24 health conditions. Annotations include demarcations of spans corresponding to medical claims, personal experiences, and questions. We collect additional granular annotations on identified claims. Specifically, we mark snippets that describe patient Populations, Interventions, and Outcomes (PIO elements) within these. Using this corpus, we introduce the task of retrieving trustworthy evidence relevant to a given claim made on social media. We propose a new method to automatically derive (noisy) supervision for this task which we use to train a dense retrieval model; this outperforms baseline models. Manual evaluation of retrieval results performed by medical doctors indicate that while our system performance is promising, there is considerable room for improvement. Collected annotations (and scripts to assemble the dataset), are available at https://github.com/sominw/redhot.
翻译:摘要: 我们提出Reddit健康在线谈论(RedHOT)语料库,包含来自Reddit的22,000条涵盖24种健康状况的社交媒体帖子,并附有丰富的标注。标注内容包括医学主张、个人经历和问题的文本片段划分。我们对识别出的主张进行了更细粒度的标注,具体标记了描述患者群体、干预措施和结局(PIO要素)的片段。利用该语料库,我们引入了针对社交媒体上特定主张检索可信证据的任务。我们提出了一种自动为该任务派生(含噪)监督信号的新方法,并用于训练稠密检索模型;该模型性能优于基线模型。由医学医生对检索结果进行的人工评估表明,尽管我们的系统性能具有潜力,但仍有较大改进空间。收集的标注数据(以及组装数据集的脚本)可在 https://github.com/sominw/redhot 获取。