BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts -- producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ's guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid's generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth -- compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT -- while maintaining competitive spatial fidelity.

翻译：现有视频生成框架将序列时长视为外部指定参数（如固定帧数或文本提示），生成的片段时间边界与真实行为数据的统计结构脱钩。这一假设从根本上与生物行为不符——在生物行为中，个体与实例间的动作时长存在自然变异，且该信息已编码于数据自身。我们提出BioVid——一种数据驱动的自回归视频生成框架，可直接从训练数据中学习生物行为的时序结构（包括其自然长度分布）。第一阶段中，有限标量量化生成对抗网络（FSQ-R3GAN）分词器将每帧视频编码为紧凑的离散表示，融合了R3GAN的稳定相对对抗训练目标与FSQ的编码本利用率保证机制，在避免编码本坍缩的同时实现高保真空间重构。第二阶段中，因果Transformer对生成的令牌序列进行自回归建模，并在行为事件达到语义闭合时自动学习发出序列结束（EOS）令牌——其终止分布完全源自训练数据本身，而非任何人为指定约束。在人类饮水行为数据集（NTU RGB+D, A001, n=94）上的实验表明：BioVid生成的长度分布与测试集实际分布高度吻合，与真实分布间的Wasserstein-1距离达1.24，远优于固定长度基线模型的6.05和VideoGPT的15.48，同时保持了具有竞争力的空间保真度。