Finetuning Large Language Models for Automated Depression Screening in Nigerian Pidgin English: GENSCORE Pilot Study

Depression is a major contributor to the mental-health burden in Nigeria, yet screening coverage remains limited due to low access to clinicians, stigma, and language barriers. Traditional tools like the Patient Health Questionnaire-9 (PHQ-9) were validated in high-income countries but may be linguistically or culturally inaccessible for low- and middle-income countries and communities such as Nigeria where people communicate in Nigerian Pidgin and more than 520 local languages. This study presents a novel approach to automated depression screening using fine-tuned large language models (LLMs) adapted for conversational Nigerian Pidgin. We collected a dataset of 432 Pidgin-language audio responses from Nigerian young adults aged 18-40 to prompts assessing psychological experiences aligned with PHQ-9 items, performed transcription, rigorous preprocessing and annotation, including semantic labeling, slang and idiom interpretation, and PHQ-9 severity scoring. Three LLMs - Phi-3-mini-4k-instruct, Gemma-3-4B-it, and GPT-4.1 - were fine-tuned on this annotated dataset, and their performance was evaluated quantitatively (accuracy, precision and semantic alignment) and qualitatively (clarity, relevance, and cultural appropriateness). GPT-4.1 achieved the highest quantitative performance, with 94.5% accuracy in PHQ-9 severity scoring prediction, outperforming Gemma-3-4B-it and Phi-3-mini-4k-instruct. Qualitatively, GPT-4.1 also produced the most culturally appropriate, clear, and contextually relevant responses. AI-mediated depression screening for underserved Nigerian communities. This work provides a foundation for deploying conversational mental-health tools in linguistically diverse, resource-constrained environments.

翻译：抑郁症是尼日利亚心理健康负担的主要成因，然而由于临床医生可及性低、病耻感及语言障碍，筛查覆盖率仍然有限。患者健康问卷-9（PHQ-9）等传统工具虽在高收入国家得到验证，但对于尼日利亚等中低收入国家及社区可能存在语言或文化上的障碍，当地民众主要使用尼日利亚皮钦语及超过520种本土语言进行交流。本研究提出一种新颖的自动化抑郁筛查方法，采用针对尼日利亚皮钦语对话场景微调的大型语言模型（LLMs）。我们收集了432份尼日利亚18-40岁青年使用皮钦语录制的音频回答，这些回答针对与PHQ-9条目对应的心理体验评估提示，并进行了转录、严格的预处理与标注，包括语义标注、俚语与习语解读以及PHQ-9严重程度评分。基于该标注数据集，我们对三种LLMs——Phi-3-mini-4k-instruct、Gemma-3-4B-it和GPT-4.1——进行了微调，并从定量（准确率、精确率及语义对齐度）和定性（清晰度、相关性及文化适宜性）两方面评估了其性能。GPT-4.1取得了最高的定量性能，在PHQ-9严重程度评分预测中达到94.5%的准确率，优于Gemma-3-4B-it和Phi-3-mini-4k-instruct。在定性评估中，GPT-4.1亦能生成最具文化适宜性、最清晰且语境最相关的回答。本研究为在语言多样、资源受限的环境中部署对话式心理健康工具奠定了基础。