With the rapid improvement of LLMs' coding capabilities, the bottleneck of LLM-based automated software development is shifting from generating correct code to eliciting users' requirements. Despite growing interest, the interview competence of LLMs in conversational requirements elicitation remains fully underexplored. Existing evaluations often depend on a few scenarios, real user interaction, and subjective human scoring, which hinders systematic and quantitative comparison. To address these challenges, we propose ReqElicitGym, an interactive and automatic evaluation environment for assessing interview competence in conversational requirements elicitation. Specifically, ReqElicitGym introduces a new evaluation dataset and designs both an interactive oracle user and a task evaluator. The dataset contains 101 website requirements elicitation scenarios spanning 10 application types. Both the oracle user and the task evaluator achieve high agreement with real users and expert judgment. Using our ReqElicitGym, any automated conversational requirements elicitation approach (e.g., LLM-based agents) can be evaluated in a reproducible and quantitative manner through interaction with the environment. Based on our ReqElicitGym, we conduct a systematic empirical study on seven representative LLMs, and the results show that current LLMs still exhibit limited interview competence in uncovering implicit requirements. Particularly, they elicit less than half of the users' implicit requirements, and their effective elicitation questions often emerge in later turns of the dialogue. Besides, we found LLMs can elicit interaction and content implicit requirements, but consistently struggle with style-related requirements. We believe ReqElicitGym will facilitate the evaluation and development of automated conversational requirements elicitation.
翻译:随着大语言模型(LLM)编码能力的快速提升,基于LLM的自动化软件开发瓶颈正从生成正确代码转向启发用户需求。尽管关注度日益增长,LLM在对话式需求启发中的访谈能力仍未得到充分探索。现有评估通常依赖少量场景、真实用户交互和主观人工评分,这阻碍了系统化与定量比较。为应对这些挑战,我们提出了ReqElicitGym——一个用于评估对话式需求启发中访谈能力的交互式自动评估环境。具体而言,ReqElicitGym引入了一个新的评估数据集,并设计了一个交互式模拟用户和一个任务评估器。该数据集包含涵盖10种应用类型的101个网站需求启发场景。模拟用户与任务评估器均与真实用户及专家判断达成高度一致。利用我们的ReqElicitGym,任何自动化对话式需求启发方法(例如基于LLM的智能体)均可通过与环境的交互,以可复现且定量的方式进行评估。基于ReqElicitGym,我们对七种代表性LLM进行了系统性实证研究,结果表明当前LLM在发掘隐性需求方面仍表现出有限的访谈能力。具体而言,它们仅能启发不到一半的用户隐性需求,且其有效启发问题往往出现在对话的后续轮次中。此外,我们发现LLM能够启发交互类和内容类隐性需求,但在风格相关需求方面持续存在困难。我们相信ReqElicitGym将推动自动化对话式需求启发的评估与发展。