Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.
翻译:控制视频和音频生成需要多种模态,包括深度、姿态、相机轨迹和音频变换,但现有方法要么为固定控制集训练单个统一模型,要么为每种新模态引入昂贵架构修改。我们提出AVControl,一种轻量级、可扩展框架,基于联合音视频基础模型LTX-2构建,其中每种控制模态作为独立LoRA在并行画布上训练,该画布通过注意力层中的额外令牌提供参考信号,除LoRA适配器本身外无需任何架构修改。我们证明,简单将基于图像的情境内方法扩展到视频在结构控制任务上会失败,而我们的并行画布方法成功解决此问题。在VACE基准上,我们在深度引导和姿态引导生成、图像内修复和外扩任务中超越所有评估基线,并在相机控制和音视频基准上展示出竞争性结果。我们的框架支持多种独立训练的模态:空间对齐控制(如深度、姿态和边缘)、带内参的相机轨迹、稀疏运动控制、视频编辑,以及据我们所知首个用于联合生成模型的模块化音视频控制。该方法兼具计算与数据高效性:每种模态仅需小规模数据集,并在几百到几千训练步内收敛,仅为统一替代方案预算的一小部分。我们公开发布代码和训练的LoRA检查点。