Variational Autoencoders (VAEs) have proven to be effective models for producing latent representations of cognitive and semantic value. We assess the degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's chorales define latent spaces representative of the circle of fifths and the hierarchical relation of each key component pitch as drawn in music cognition. In detail, we compare the latent space of different VAE corpus encodings -- Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions -- in providing a pitch space for key relations that align with cognitive distances. We evaluate the model performance of these encodings using objective metrics to capture accuracy, mean square error (MSE), KL-divergence, and computational cost. The ABC encoding performs the best in reconstructing the original data, while the Pitch DFT seems to capture more information from the latent space. Furthermore, an objective evaluation of 12 major or minor transpositions per piece is adopted to quantify the alignment of 1) intra- and inter-segment distances per key and 2) the key distances to cognitive pitch spaces. Our results show that Pitch DFT VAE latent spaces align best with cognitive spaces and provide a common-tone space where overlapping objects within a key are fuzzy clusters, which impose a well-defined order of structural significance or stability -- i.e., a tonal hierarchy. Tonal hierarchies of different keys can be used to measure key distances and the relationships of their in-key components at multiple hierarchies (e.g., notes and chords). The implementation of our VAE and the encodings framework are made available online.
翻译:变分自编码器(VAE)已被证明是生成具有认知与语义价值的潜在表征的有效模型。我们评估了基于典型调性音乐语料库(371首巴赫众赞歌)训练的VAE所定义的潜在空间,在多大程度上能够表征五度循环以及音乐认知中每个调性成分音高的层级关系。具体而言,我们比较了不同VAE语料编码(钢琴卷帘、MIDI、ABC、音网、音高DFT及音级分布)的潜在空间,以提供符合认知距离的调性关系音高空间。我们采用客观指标评估这些编码的模型性能,涵盖准确率、均方误差(MSE)、KL散度及计算成本。ABC编码在重构原始数据方面表现最佳,而音高DFT则从潜在空间中捕获了更多信息。此外,我们采用每首乐曲12种大调或小调移调的客观评估方法,量化了1)各调性内部及片段间距离,以及2)调性距离与认知音高空间的对齐程度。结果表明,音高DFT-VAE潜在空间与认知空间的对齐效果最优,并构建了一个共享音高空间——其中调性内的重叠对象形成模糊聚类,从而定义了明确的结构重要性或稳定性顺序(即调性层级)。不同调性的层级可用于衡量调性距离及其多层级调内成分(如音符与和弦)的关系。本研究中VAE及编码框架的实现代码已公开提供。