The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.
翻译:大语言模型(LLM)的使用日益广泛,但观察到其性能会随提示风格和语气而变化。在本研究中,我们探讨了提示中的语气变化是否以及如何导致客观多选题的LLM准确率差异。我们使用两个数据集:一个包含50道基础问题及其五种语气变体的数据集,以及一个涵盖57个学科、包含570道基础问题及七种语气变体的MMLU子集。实验评估了四种成本效益高且广受欢迎的LLM的性能:ChatGPT-4o、ChatGPT-5-nano、Gemini 2.5 Flash和Gemini 2.5 Flash Lite。不同模型间的语气效应具有系统性,但高度依赖于模型。部分模型表现出微小但统计显著的变化,而其他模型在不同语气间的准确率波动较大。此外,我们识别出学科层面在语气敏感性上的差异,并提出了一个路由框架来解释语气如何可能调节内部推理模式。我们的发现警示用户在部署LLM时,不应假设其具有对语气鲁棒的可靠性。