For safety-critical applications, it is crucial to audit 3D human pose estimators before deployment. Will the system break down if the weather or the clothing changes? Is it robust regarding gender and age? To answer these questions and more, we need controlled studies with images that differ in a single attribute, but real benchmarks cannot provide such pairs. We thus present STAGE, a GenAI data toolkit for auditing 3D human pose estimators. For STAGE, we develop the first GenAI image creator with accurate 3D pose control and propose a novel evaluation strategy to isolate and quantify the effects of single factors such as gender, ethnicity, age, clothing, location, and weather. Enabled by STAGE, we generate a series of benchmarks to audit, for the first time, the sensitivity of popular pose estimators towards such factors. Our results show that natural variations can severely degrade pose estimator performance, raising doubts about their readiness for open-world deployment. We aim to highlight these robustness issues and establish STAGE as a benchmark to quantify them.
翻译:在安全关键应用中,部署前对三维人体姿态估计器进行审计至关重要。若天气或衣着发生变化,系统是否会失效?其对于性别与年龄是否具备鲁棒性?为回答这些问题及其他相关疑问,我们需要基于单一属性差异图像的受控研究,但现有真实基准无法提供此类配对数据。为此,我们提出STAGE——一个用于审计三维人体姿态估计器的生成式人工智能数据工具包。在STAGE中,我们开发了首个具备精确三维姿态控制能力的生成式人工智能图像生成器,并提出一种新颖的评估策略,以隔离并量化性别、种族、年龄、衣着、场景与天气等单一因素的影响。借助STAGE,我们生成了一系列基准测试,首次系统审计了主流姿态估计器对上述因素的敏感性。实验结果表明,自然场景的变异会严重降低姿态估计器的性能,这对其在开放世界部署的适用性提出了质疑。本研究旨在揭示这些鲁棒性问题,并将STAGE确立为量化此类问题的基准工具。