More than a Judge: An Empirical Study of Agent-Human Interaction in Crowdsourced Testing Assessment

Agentic AI is increasingly being integrated into software engineering workflows. In crowdsourced testing, however, the large volume and uneven quality of submitted reports still create a substantial review burden for developers. In prior work, we developed and validated a multi-agent assessment backbone based on the LLM-as-a-Judge paradigm. That backbone assesses reports along three dimensions--textuality, adequacy, and competitiveness--and was shown to align well with human consensus while substantially reducing assessment effort. Yet reliable automated judging does not by itself show whether agent outputs can improve human work when embedded into workflow. This paper studies that missing question in the context of crowdsourced testing. We investigate whether assessment-derived, actionable feedback can improve how testers revise reports, perform on later tasks, and transfer reporting practices across applications. To do so, we conducted a controlled four-stage human-subject study with 20 testers across three real-world applications. The results show that agent-generated feedback supports immediate improvements in revised reports, better first submissions on a new task after prior feedback exposure, and evidence of partial but meaningful transfer to a later application. A post-task questionnaire completed by 17 participants complements these artifact-based findings by suggesting that the feedback was generally understandable, acted upon in revision, and carried into later tasks, while also revealing remaining friction in specificity and execution. Overall, the study provides empirical evidence that, in the studied crowdsourced testing setting, assessment agents can serve not only as post-hoc judges but also as workflow-integrated feedback providers that support upstream report-quality improvement.

翻译：智能体AI正日益融入软件工程工作流程。然而在众包测试中，提交报告数量庞大且质量参差不齐，仍给开发人员带来沉重的审查负担。在先前工作中，我们基于LLM-as-a-Judge范式开发并验证了一个多智能体评估框架。该框架从文本性、充分性和竞争性三个维度对报告进行评估，证明与人类共识具有良好一致性，同时显著降低了评估工作量。然而，可靠的自动评判本身并不能说明当智能体输出嵌入工作流时能否改善人类工作。本文针对众包测试场景下的这一缺失问题展开研究，探究基于评估产生的可操作反馈能否提升测试人员修改报告、后续任务表现以及跨应用迁移报告撰写实践的能力。为此，我们针对三个真实世界的应用场景，与20名测试人员开展了受控的四阶段人类受试者实验。结果表明，智能体生成的反馈能支持：修改报告时的即时改进、在先前接受反馈后新任务的首次提交质量提升，以及向后续应用的部分但有意义的迁移证据。17名参与者完成的任务后问卷补充了这些基于工件的发现，表明反馈总体上易于理解、在修订中得到采纳并延续至后续任务，同时揭示了在具体性和执行层面仍存在的摩擦。总体而言，该研究提供了实证证据表明：在所研究的众包测试环境中，评估智能体不仅能作为事后评判者，还能作为集成于工作流程中的反馈提供者，支持上游报告质量提升。