FMBench: Adaptive Large Language Model Output Formatting

Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.

翻译：生成同时满足语义意图与格式约束的输出，对于在面向用户及系统集成的工作流中部署大语言模型至关重要。本研究聚焦于Markdown格式化，该格式在助手、文档及工具增强流程中无处不在，但仍易出现微妙且难以检测的错误（例如断裂的列表、格式错误的表格、不一致的标题及无效的代码块），这些错误可能显著降低下游可用性。我们提出了FMBench，一个用于自适应Markdown输出格式化的基准测试，该基准在具有多样化结构要求的广泛指令遵循场景下评估模型。FMBench强调现实世界的格式化行为，例如多级组织、混合内容（自然语言与列表/表格/代码交错）以及对用户指定布局约束的严格遵循。为了在不依赖硬解码约束的情况下提升Markdown合规性，我们提出了一种轻量级对齐流程，该流程结合了监督微调与强化学习微调。从一个基础模型出发，我们首先在指令-响应对上进行监督微调，随后优化一个平衡语义保真度与结构正确性的复合目标。在两个模型系列（OpenPangu与Qwen）上的实验表明，监督微调持续提升了语义对齐，而强化学习在从强监督微调策略初始化时，对具有挑战性的Markdown指令的鲁棒性提供了额外增益。我们的结果还揭示了语义目标与结构目标之间固有的权衡，突显了精心设计奖励机制对于可靠格式化生成的重要性。代码发布于：https://github.com/FudanCVL/FMBench。