Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.
翻译:文本引导的人体姿态编辑在AIGC应用中已获得广泛关注,但其仍受结构异常和生成伪影的困扰。现有评估指标常将真实性检测与质量评估割裂,无法提供针对姿态特定不一致性的细粒度分析。为突破这些局限,我们提出HPE-Bench——一个包含来自17个前沿编辑模型的1700个标准化样本的专业基准,同时提供真实性标签与多维度质量评分。进一步,我们提出了基于层选择性多模态大语言模型(MLLMs)的统一框架。通过采用对比LoRA调优和创新的层敏感度分析(LSA)机制,我们确定了用于姿态评估的最优特征层。该框架在真实性检测和多维度质量回归任务中均实现卓越性能,有效弥合了取证检测与质量评估之间的鸿沟。