Early-stage design ideation often relies on rough sketches created under time pressure, leaving much of the designer's intent implicit. In practice, designers frequently speak while sketching, verbally articulating functional goals and ideas that are difficult to express visually. We introduce TalkSketchD, a sketch-while-speaking dataset that captures spontaneous speech temporally aligned with freehand sketches during early-stage toaster ideation. To examine the dataset's value, we conduct a sketch-to-image generation study comparing sketch-only inputs with sketches augmented by concurrent speech transcripts using multimodal large language models (MLLMs). Generated images are evaluated against designers' self-reported intent using a reasoning MLLM as a judge. Quantitative results show that incorporating spontaneous speech significantly improves judged intent alignment of generated design images across form, function, experience, and overall intent. These findings demonstrate that temporally aligned sketch-and-speech data can enhance MLLMs' ability to interpret user intent in early-stage design ideation.
翻译:早期设计构思通常依赖于时间压力下绘制的粗略草图,这使得设计师的诸多意图隐含其中。在实践中,设计师常边绘图边说话,口头阐述难以通过视觉方式表达的功能性目标和想法。我们提出了TalkSketchD——一个“边说边画”数据集,该数据集捕捉了早期烤面包机构思过程中与手绘草图时间对齐的自发语音。为检验该数据集的价值,我们开展了一项草图到图像生成的对比研究,将仅含草图的输入与结合多模态大语言模型(MLLMs)处理后的草图及并发语音转录文本进行对比。通过使用推理MLLM作为评判者,将生成的图像与设计师自我报告的意图进行对比评估。定量结果表明,融入自发语音显著提升了生成的设计图像在形态、功能、体验及整体意图方面的对齐程度。这些发现证明,时间对齐的草图-语音数据能够增强MLLMs在早期设计构思中解读用户意图的能力。