Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records

Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa). Test-retest stability was evaluated using approximately 15,000 independently classified records. Results: Five substance categories achieved almost perfect inter-method agreement (kappa = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Test-retest agreement ranged from 92.1% to 99.1% across the seven categories. Conclusions: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification.

翻译：背景：近期研究表明，大型语言模型（LLMs）能够对儿童福利叙述文本执行二元分类任务，检测是否存在物质相关问题、家庭暴力和枪支涉入等结构特征。然而，更小规模、可本地部署的模型能否超越二元检测，从这些叙述中分类特定物质类型，尚未得到验证。目的：验证一个本地部署的LLM分类器在儿童福利调查叙述中识别符合DSM-5分类的特定物质类型的有效性。方法：采用一个本地部署的200亿参数LLM对美国中西部某州的儿童虐待调查叙述进行分类。先前被识别为包含物质相关问题的记录进入第二分类阶段，针对七种DSM-5物质类别进行分类。通过对900例分层案例的专家人工评审，评估分类的精确率、召回率及方法间可靠性（Cohen's kappa）。使用约15,000条独立分类记录评估重测稳定性。结果：五类物质达到几乎完美的方法间一致性（kappa = 0.94-1.00）：酒精、大麻、阿片类、兴奋剂及镇静剂/催眠剂/抗焦虑剂。这些类别的分类精确率达92%至100%。两类低流行率类别（致幻剂、吸入剂）表现不佳。七类物质的重测一致性介于92.1%至99.1%之间。结论：小型本地部署LLM能够可靠地从儿童福利行政文本中分类物质类型，将先前二元分类研究扩展至多标签物质识别领域。