Automatic pathological speech detection approaches have shown promising results, gaining attention as potential diagnostic tools alongside costly traditional methods. While these approaches can achieve high accuracy, their lack of interpretability limits their applicability in clinical practice. In this paper, we investigate the use of multimodal Large Language Models (LLMs), specifically ChatGPT-4o, for automatic pathological speech detection in a few-shot in-context learning setting. Experimental results show that this approach not only delivers promising performance but also provides explanations for its decisions, enhancing model interpretability. To further understand its effectiveness, we conduct an ablation study to analyze the impact of different factors, such as input type and system prompts, on the final results. Our findings highlight the potential of multimodal LLMs for further exploration and advancement in automatic pathological speech detection.
翻译:自动病理语音检测方法已展现出有前景的结果,作为潜在诊断工具与成本高昂的传统方法一同受到关注。尽管这些方法能够实现高准确率,但其可解释性的缺乏限制了它们在临床实践中的应用。本文研究了多模态大语言模型(LLMs),特别是ChatGPT-4o,在少样本上下文学习设置下用于自动病理语音检测的可行性。实验结果表明,该方法不仅提供了有前景的性能,还能为其决策提供解释,从而增强了模型的可解释性。为进一步理解其有效性,我们进行了消融研究,以分析不同因素(如输入类型和系统提示)对最终结果的影响。我们的发现凸显了多模态LLMs在自动病理语音检测领域进一步探索和推进的潜力。