Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.
翻译:全模态记谱处理(ONP)因需在听觉、视觉与符号域间建立严格的多维对齐,成为全模态人工智能的独特前沿领域。当前研究仍呈碎片化状态,聚焦于孤立的转录任务,未能弥合表层模式识别与底层音乐逻辑之间的鸿沟。该领域还面临西方五线谱严重偏向的记谱偏见,以及以"大语言模型作为评判者"指标的系统性不可靠性——后者常通过系统性幻觉掩盖结构推理缺陷。为建立更严苛的评估标准,我们提出多格式基准测试ONOTE,其采用基于经典音高投射的确定性流水线,消除不同记谱系统中的主观评分偏差。对主流全模态模型的评估揭示了感知准确性与音乐理论理解之间的根本性脱节,为诊断复杂规则约束域中的推理脆弱性提供了必要框架。