Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.
翻译:目的:通过避免不必要的推理同时保持准确性,提高大型语言模型在医学问答中的效率。方法:我们提出选择性思维链,一种推理时策略,首先预测问题是否需要推理,仅在必要时生成推理依据。在两个开源大型语言模型(Llama-3.1-8B 和 Qwen-2.5-7B)上,使用四个生物医学问答基准数据集——HeadQA、MedQA-USMLE、MedMCQA 和 PubMedQA 进行评估。评估指标包括准确率、生成的总令牌数和推理时间。结果:选择性思维链在准确率损失最小($\leq$4%)的情况下,将推理时间减少了 13-45%,令牌使用量减少了 8-47%。在某些模型-任务组合中,它比标准思维链实现了更高的准确率和更高的效率。与固定长度思维链相比,选择性思维链以显著更低的计算成本达到了相似或更优的准确率。讨论:选择性思维链通过仅在有益时调用显式推理,动态平衡推理深度与效率,减少了对回忆型问题的冗余处理,同时保持了可解释性。结论:选择性思维链为医学问答提供了一种简单、模型无关且经济高效的方法,通过使推理努力与问题复杂性相匹配,增强了基于大型语言模型的临床系统在现实世界中的可部署性。