Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench
翻译:大语言模型在符号音乐评估方面仍存在表示形式、数据集与指标碎片化的问题。我们提出LilyBench——基于LilyPond格式的基准测试,用于在同一系列开源权重大语言模型上联合评估符号音乐生成与音乐理解能力。该基准包含200个提示的生成测试集及改编自ABC-Eval的十项理解任务,涵盖语法分析、元数据预测、结构排序与音乐识别。生成质量通过编译率、基于詹森-香农相似度的MusPy描述符分布,以及基于LilyBERT的弗雷歇音乐距离(FMD)进行评估。对四个开源权重模型的实验表明:在零样本设置下可实现可执行的LilyPond生成,但结构理解任务仍具挑战性,尽管模型在作曲家和流派识别方面表现优异。实验同时揭示了基于描述符与基于嵌入的评估指标之间的系统性分歧,表明符号音乐评估更适合采用多指标三角互证而非单一评分排名。我们已发布基准测试、提示库及评估代码(https://github.com/CSCPadova/lilybench),以支持符号音乐生成与理解领域的未来研究。