Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues -- a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB \(+\) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.
翻译:视频生成技术近期在视觉效果方面取得了显著进展,但保持连贯的对象运动与交互仍具挑战。我们追溯了两个实际瓶颈:(i)人工提供的运动提示(如小型二维地图)在编码后常坍缩为过少的有效标记,削弱了引导效果;(ii)在单一头部中同时优化外观与运动易使模型偏向纹理细节而牺牲时间一致性。我们提出STANCE——一种通过两个简洁组件解决上述问题的图像到视频生成框架。首先,我们引入实例提示:这是一种像素对齐的控制信号,通过平均每个实例的光流并在实例掩码上增强单目深度信息,将稀疏、用户可编辑的提示转化为稠密的2.5维(相机相对)运动场。相较于二维箭头输入,该方法在保持易用性的同时降低了深度模糊性。其次,我们通过稠密旋转位置编码在标记空间中保持这些提示的显著性:该技术使用空间可寻址的旋转嵌入对少量运动标记(锚定于首帧)进行标记。结合联合RGB+辅助图(分割或深度)预测,我们的模型通过辅助图锚定结构,RGB处理外观,从而在无需逐帧轨迹脚本的情况下稳定优化过程并提升时间连贯性。