Objective: This study develops a systematic benchmarking framework for testing whether language models can accurately identify constructs of interest in child welfare records. The objective is to assess how different model sizes and architectures perform on four validated benchmarks for classifying critical risk factors among child welfare-involved families: domestic violence, firearms, substance-related problems generally, and opioids specifically. Method: We constructed four benchmarks for identifying risk factors in child welfare investigation summaries: domestic violence, substance-related problems, firearms, and opioids (n=500 each). We evaluated seven model sizes (0.6B-32B parameters) in standard and extended reasoning modes, plus a mixture-of-experts variant. Cohen's kappa measured agreement with gold standard classifications established by human experts. Results: The benchmarking revealed a critical finding: bigger models are not better. A small 4B parameter model with extended reasoning proved most effective, outperforming models up to eight times larger. It consistently achieved "substantial" to "almost perfect" agreement across all four benchmark categories. This model achieved "almost perfect" agreement (\k{appa} = 0.93-0.96) on three benchmarks (substance-related problems, firearms, and opioids) and "substantial" agreement (\k{appa} = 0.74) on the most complex task (domestic violence). Small models with extended reasoning rivaled the largest models while being more resource-efficient. Conclusions: Small reasoning-enabled models achieve accuracy levels historically requiring larger architectures, enabling significant time and computational efficiencies. The benchmarking framework provides a method for evidence-based model selection to balance accuracy with practical resource constraints before operational deployment in social work research.
翻译:目的:本研究开发了一个系统性基准测试框架,用于检验语言模型能否准确识别儿童福利记录中的目标构念。目标是评估不同模型规模和架构在四项经过验证的基准测试中的表现,这些测试用于对涉及儿童福利家庭的关键风险因素进行分类:家庭暴力、枪支、一般物质相关问题以及阿片类药物具体问题。方法:我们构建了四项用于识别儿童福利调查摘要中风险因素的基准测试:家庭暴力、物质相关问题、枪支和阿片类药物(各n=500)。我们评估了七种参数规模(0.6B-32B参数)的模型在标准与扩展推理模式下的表现,以及一种专家混合变体。采用Cohen's kappa系数衡量模型与人类专家确立的金标准分类之间的一致性。结果:基准测试揭示了一个关键发现:模型并非越大越好。具有扩展推理功能的40亿参数小模型被证明最为有效,其表现优于参数规模达八倍的大型模型。该模型在所有四个基准测试类别中持续达到"实质性"至"几乎完美"的一致性水平。在三个基准测试(物质相关问题、枪支和阿片类药物)上获得"几乎完美"一致性(κ = 0.93-0.96),在最复杂的任务(家庭暴力)上达到"实质性"一致性(κ = 0.74)。具备扩展推理能力的小模型在保持更高资源效率的同时,其性能可与最大型模型相媲美。结论:具备推理能力的小模型达到了传统上需要大型架构才能实现的准确度水平,从而显著提升了时间与计算效率。该基准测试框架为社会工作研究中的实际部署提供了一种基于证据的模型选择方法,可在准确性与实际资源约束之间取得平衡。