Dialogue Act (DA) annotation typically treats communicative or pedagogical intent as localized to individual utterances or turns. This leads annotators to agree on the underlying action while disagreeing on segment boundaries, reducing apparent reliability. We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria, and evaluate LLM-based segmenters against standard and retrieval-augmented baselines. To assess these without gold labels, we introduce evaluation metrics for span consistency, distinctiveness, and human-AI distributional agreement. We found DA-awareness produces segments that are internally more consistent than text-only baselines. While LLMs excel at creating construct-consistent spans, coherence-based baselines remain superior at detecting global shifts in dialogue flow. Across two datasets, no single segmenter dominates. Improvements in within-segment coherence frequently trade off against boundary distinctiveness and human-AI distributional agreement. These results highlight segmentation as a consequential design choice that should be optimized for downstream objectives rather than a single performance score.
翻译:对话行为标注通常将交际或教学意图视为局限于单个话语或话轮。这导致标注者在底层行为上达成一致,却在片段边界上存在分歧,从而降低了表面信度。我们提出代码本注入式分割方法,该方法将边界决策条件化于下游标注标准,并评估基于大语言模型的分割器与标准基线及检索增强基线的性能。为在无黄金标签的情况下评估这些方法,我们引入了针对跨度一致性、区分度以及人机分布一致性的评估指标。研究发现,具备对话行为意识的分割方法产生的片段,其内部一致性优于纯文本基线。虽然大语言模型在创建结构一致的跨度方面表现优异,但基于连贯性的基线在检测对话流全局转换方面仍更具优势。在两个数据集上,没有任何单一分割器占据全面优势。片段内连贯性的提升往往以边界区分度及人机分布一致性的降低为代价。这些结果表明,分割是一项关键的设计选择,应针对下游目标进行优化,而非追求单一性能分数。