Collecting large, aligned cross-modal datasets for music-flavor research is difficult because perceptual experiments are costly and small by design. We address this bottleneck through two complementary experiments. The first tests whether audio-flavor correlations, feature-importance rankings, and latent-factor structure transfer from an experimental soundtracks collection (257~tracks with human annotations) to a large FMA-derived corpus ($\sim$49,300 segments with synthetic labels). The second validates computational flavor targets -- derived from food chemistry via a reproducible pipeline -- against human perception in an online listener study (49~participants, 20~tracks). Results from both experiments converge: the quantitative transfer analysis confirms that cross-modal structure is preserved across supervision regimes, and the perceptual evaluation shows significant alignment between computational targets and listener ratings (permutation $p<0.0001$, Mantel $r=0.45$, Procrustes $m^2=0.51$). Together, these findings support the conclusion that sonic seasoning effects are present in synthetic FMA annotations. We release datasets and companion code to support reproducible cross-modal AI research.
翻译:收集用于音乐-口味研究的大规模、对齐的跨模态数据集十分困难,因为感知实验成本高昂且规模较小。我们通过两个互补实验来解决这一瓶颈。第一个实验测试了音频-口味相关性、特征重要性排序及潜在因子结构是否能够从实验性音轨集(含有人工标注的257条音轨)迁移至基于FMA的大规模语料库(含合成标签的约49,300个片段)。第二个实验通过在线听众研究(49名参与者,20条音轨),验证了基于食品化学的可复现流程计算出的味觉目标与人类感知的一致性。两个实验的结果相互印证:定量迁移分析证实跨模态结构在不同监督模式下得以保持,而感知评估显示计算目标与听众评分之间存在显著对齐(置换检验p<0.0001,Mantel r=0.45,Procrustes m²=0.51)。这些发现共同支持了合成FMA标注中存在着声音调味效应的结论。我们发布了数据集及配套代码,以支持可复现的跨模态人工智能研究。