An Empirical Study of Gemini 3 for Detecting Natural Language Test Smells in Manual Test Cases

Manual testing, in which testers follow natural language instructions to validate system behavior, remains essential for uncovering issues that are difficult to capture with automation. However, manual test cases often contain test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce reliability, maintainability, and reproducibility. Existing detection approaches largely depend on manually engineered rules and thus struggle to generalize and scale across heterogeneous test suites. In our previous work, we assessed the feasibility of using Small Language Models (SLMs) for test smell detection by evaluating GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B on test steps from 143 real-world Ubuntu test cases, covering seven smell types. PHI-4-14B achieved the best performance. In this article, we investigate whether a contemporary Large Language Model (GEMINI-3-PRO-PREVIEW) available at the time of the study can identify test smells in natural language manual test cases using a prompt-based, whole-test-case analysis strategy. Unlike approaches that analyze individual test steps in isolation, our approach evaluates complete test cases, enabling the model to consider relationships and dependencies among test steps. We evaluate the approach on 100 Ubuntu test cases covering seven test smell types and compare its performance against previously evaluated SLMs, including GEMMA-3-4B, LLAMA-3.2-3B, and PHI-4-14B. Our results show that GEMINI-3-PRO-PREVIEW outperforms the SLMs, while producing actionable explanations that can help practitioners revise manual test cases for greater clarity and consistency. We also find that test smells are pervasive in practice, with nearly one detected test smell per step on average, highlighting the need for scalable and automated quality support for manual testing artifacts.

翻译：手动测试（测试人员遵循自然语言指令来验证系统行为）仍然是发现自动化难以捕获问题的关键环节。然而，手动测试用例常常包含测试异味——诸如模糊性、冗余或缺失检查等质量问题——这些问题会降低可靠性、可维护性和可复现性。现有检测方法大多依赖人工设计的规则，因此在异构测试套件中难以实现泛化和扩展。在我们先前的工作中，我们评估了使用小语言模型进行测试异味检测的可行性，基于来自143个真实世界Ubuntu测试用例的测试步骤，对GEMMA-3-4B、LLAMA-3.2-3B和PHI-4-14B进行了评估，涵盖七种异味类型。其中，PHI-4-14B取得了最佳性能。在本文中，我们研究了一种在研究时可用的当代大语言模型（GEMINI-3-PRO-PREVIEW），是否能够通过基于提示的完整测试用例分析策略，识别自然语言手动测试用例中的测试异味。与孤立分析单个测试步骤的方法不同，我们的方法评估完整测试用例，使模型能够考虑测试步骤之间的关系和依赖。我们在100个Ubuntu测试用例上评估了该方法，涵盖七种测试异味类型，并将其性能与先前评估过的小语言模型（包括GEMMA-3-4B、LLAMA-3.2-3B和PHI-4-14B）进行了比较。结果表明，GEMINI-3-PRO-PREVIEW优于小语言模型，同时能生成可操作的解释，帮助从业者修订手动测试用例以提高清晰度和一致性。我们还发现，测试异味在实践中普遍存在，平均每个步骤检测到近一个测试异味，这凸显了对手动测试工件提供可扩展且自动化质量支持的需求。