Most digital music tools emphasize precision and control, but often lack support for tactile, improvisational workflows grounded in environmental interaction. Lumia addresses this by enabling users to "compose through looking"--transforming visual scenes into musical phrases using a handheld, camera-based interface and large multimodal models. A vision-language model (GPT-4V) analyzes captured imagery to generate structured prompts, which, combined with user-selected instrumentation, guide a text-to-music pipeline (Stable Audio). This real-time process allows users to frame, capture, and layer audio interactively, producing loopable musical segments through embodied interaction. The system supports a co-creative workflow where human intent and model inference shape the musical outcome. By embedding generative AI within a physical device, Lumia bridges perception and composition, introducing a new modality for creative exploration that merges vision, language, and sound. It repositions generative music not as a task of parameter tuning, but as an improvisational practice driven by contextual, sensory engagement.
翻译:大多数数字音乐工具强调精确性与控制性,但往往缺乏对基于环境交互的触觉式即兴创作流程的支持。Lumia通过让用户“通过观看作曲”来解决这一问题——利用手持式摄像头界面与大型多模态模型,将视觉场景转化为音乐乐句。一个视觉语言模型(GPT-4V)分析捕获的图像以生成结构化提示,这些提示与用户选择的乐器配置相结合,引导文本到音乐的生成流程(Stable Audio)。这一实时过程使用户能够以交互方式取景、捕获并分层叠加音频,通过具身交互产生可循环的音乐片段。该系统支持一种协同创作的工作流,其中人类意图与模型推理共同塑造音乐成果。通过将生成式人工智能嵌入物理设备,Lumia连接了感知与作曲,引入了一种融合视觉、语言与声音的创造性探索新模态。它将生成式音乐重新定位,不再视为参数调优的任务,而是作为一种由情境化感官参与驱动的即兴实践。