Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.
翻译:近年来,扩散模型的进展显著提升了虚拟试穿系统的视觉保真度,但可靠的评估仍然是一个持续的瓶颈。传统指标难以量化细粒度的纹理细节和语义一致性,而现有数据集在规模和多样性上未能达到商业标准。我们提出了OpenVTON-Bench,这是一个包含约10万对高分辨率图像(高达$1536 \times 1536$)的大规模基准。该数据集采用基于DINOv3的分层聚类进行语义平衡采样,并利用Gemini驱动的密集描述生成,确保了图像在20个细粒度服装类别上的均匀分布。为了支持可靠的评估,我们提出了一种多模态评估协议,从五个可解释的维度衡量VTON的质量:背景一致性、身份保真度、纹理保真度、形状合理性和整体真实感。该协议将基于视觉语言模型的语义推理与一种新颖的基于SAM3分割和形态学腐蚀的多尺度表征度量相结合,从而能够将边界对齐误差与内部纹理伪影分离开来。实验结果表明,该协议与人类判断具有高度一致性(Kendall's $τ$为0.833,而SSIM为0.611),为VTON评估建立了一个稳健的基准。