Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preference datasets with a single overall verdict per comparison. Designers evaluate designs along several distinct axes (e.g., typography, layout, color harmony) that a single preference label collapses. We release \emph{TASTE} \textit{(Typography, Aesthetics, Spatial, Tone, Etc.)}, a multi-dimensional preference dataset in which two disjoint cohorts of five professional designers each ranked outputs from four current text-to-image models across nine criteria along with per-image hallucination flags. We pair the dataset with two contributions. First, a criterion-agnostic signal-validation framework based on Kendall's $τ$, majority-vote probability, and Condorcet cycles against exact iid-uniform nulls; the analysis reveals significant but moderate designer agreement, with every TASTE criterion rejecting the random-rater null. Second, we benchmark preference models on TASTE and find that off-the-shelf VLM judges and dedicated T2I scorers fail to reach majority agreement with the designer panel, while a small MLP head trained directly on TASTE substantially narrows the gap to the single-rater ceiling, setting a baseline for future TASTE-trained preference models.
翻译:文本到图像模型现已能够以生产级规模生成平面设计作品,但其监督信号仍主要来自每项比较仅给出单一整体判定结果的照片风格偏好数据集。设计师通常沿多个独立维度(如排版、布局、色彩和谐度)评估设计,而单个偏好标签会掩盖这些维度差异。为此,我们发布了**TASTE**(《排版、美学、空间、色调等》)多维偏好数据集,其中由两个各含五名专业设计师的独立小组,依据九项准则对四种当前主流文本到图像模型的输出进行排序,并附带每张图像的幻觉标记。该数据集伴随两项贡献:其一,基于Kendall τ系数、多数投票概率及Condorcet循环检验(对抗精确独立同分布零假设)的准则无关信号验证框架;分析表明设计师间存在显著但适度的共识,且TASTE每项准则均能拒绝随机评分者的零假设。其二,我们在TASTE上对偏好模型进行基准测试,发现现成VLM评分器与专用T2I评分器均无法达到与设计师小组的多数共识,而基于TASTE直接训练的小型MLP预测头将性能差距大幅缩小至单评分者上限,为未来基于TASTE训练的偏好模型奠定基线。