ODD: A Benchmark Dataset for the NLP-based Opioid Related Aberrant Behavior Detection

Opioid related aberrant behaviors (ORAB) present novel risk factors for opioid overdose. Previously, ORAB have been mainly assessed by survey results and by monitoring drug administrations. Such methods however, cannot scale up and do not cover the entire spectrum of aberrant behaviors. On the other hand, ORAB are widely documented in electronic health record notes. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset comprising of more than 750 publicly available EHR notes. ODD has been designed to identify ORAB from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiapines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing (NLP) models (finetuning pretrained language models and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the finetuning models in most cateogories and the gains were especially higher among uncommon categories (Suggested aberrant behavior, Diagnosed opioid dependency and Medication change). Although the best model achieved the highest 83.92\% on area under precision recall curve, uncommon classes (Suggested Aberrant Behavior, Diagnosed Opioid Dependence, and Medication Change) still have a large room for performance improvement.

翻译：阿片类药物相关异常行为（ORAB）是阿片类药物过量的新型风险因素。此前，ORAB主要通过调查结果和药物管理监测进行评估，但这些方法难以规模化，且无法覆盖全部异常行为谱系。另一方面，电子健康记录（EHR）笔记中广泛记载了ORAB。本文提出一个名为ODD（ORAB检测数据集）的新型生物医学自然语言处理基准数据集。ODD是由专家标注的数据集，包含750余份公开可用的EHR笔记。该数据集旨在从患者EHR笔记中识别ORAB，并将其分为九类：1）确认异常行为、2）疑似异常行为、3）阿片类药物、4）适应证、5）确诊阿片类药物依赖、6）苯二氮䓬类药物、7）药物变更、8）中枢神经系统相关、9）健康社会决定因素。我们探索了两种最先进的自然语言处理模型（微调预训练语言模型与提示微调方法）以识别ORAB。实验结果表明，在大多数类别中，提示微调模型表现优于微调模型，尤其在低频类别（疑似异常行为、确诊阿片类药物依赖和药物变更）中优势更为显著。尽管最优模型在精确率-召回率曲线下面积上达到83.92%，但低频类别（疑似异常行为、确诊阿片类药物依赖和药物变更）仍存在较大的性能提升空间。