Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.
翻译:从非结构化的电子健康记录中提取药物使用信息,仍然是临床自然语言处理领域的一项重大挑战。尽管大型语言模型取得了进展,但其在临床自然语言处理中的应用,仍受限于对可信度、可控性和效率的担忧。为此,我们提出了NOWJ团队提交至BioCreative IX会议ToxHabits共享任务的方案。该任务旨在检测西班牙语临床文本中的有毒物质使用及其上下文属性,这是一个特定领域、低资源的环境。我们提出了一个多输出集成系统,以同时处理子任务1(ToxNER)和子任务2(ToxUse)。我们的系统将BETO与CRF层结合用于序列标注,采用了多样化的训练策略,并利用句子过滤来提高精确度。我们的最佳运行结果在触发词检测上取得了0.94的F1值和0.97的精确度,在论元检测上取得了0.91的F1值。