PitchBench: Measuring Pitch Hearing in Audio-Language Models

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

翻译：音频语言模型（ALMs）正越来越多地应用于需要理解音乐的实际场景中，例如音乐辅导、转录、字幕生成、推荐系统及音乐制作等。更广泛地说，它们正成为多模态AI系统的重要组成部分，这类系统需要从感官输入而非仅从文本进行推理。因此，可靠的音乐感知能力成为一个关键前提：如果模型无法准确感知声音结构，就无法信任它对音频进行推理、教学、转录或在实际世界中做出响应。然而，现有基准评估很少测试这种感知能力中最基本的音乐能力之一：音高听觉。当前的评估往往仅通过高阶任务间接测试音高听觉，且常采用多项选择格式，这使得模型在不同乐器、声学条件和响应格式下识别精细音高的可靠性仍然未知。我们提出PitchBench——一个系统性测量ALMs音高听觉能力的评估套件。PitchBench包含28项实验，涵盖序列与和弦中的绝对音高和相对音高感知，同时变化响度、音符时长、声源、时间拉伸、背景噪声及其他声学条件。任务范围从独立识别单个音高，到在四部音乐织体中追踪旋律线条。通过评估前沿ALMs，我们发现其音高听觉能力高度不可靠：模型在各设定下表现持续不佳，准确率随声源、音符时长和记号格式变化剧烈。当前ALMs尚未具备稳定的音高感知能力，即使面对可控的合成与器乐刺激也是如此。除基准测试外，我们还以Python包的形式发布PitchBench，包含评估数据与数据生成工具，以支持未来音高感知音频语言建模的研究。