TAGRPO：通过直接轨迹对齐增强图像到视频生成中的GRPO (TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment) - 专知论文

会员服务 ·

0

视频 · 对齐 · 视频生成 · 策略优化 · 集成 ·

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

翻译：TAGRPO：通过直接轨迹对齐增强图像到视频生成中的GRPO

Jin Wang,Jianxiang Lu,Guangzheng Xu,Comi Chen,Haoyu Yang,Linqing Wang,Peng Chen,Mingtao Chen,Zhichao Hu,Longhuang Wu,Shuai Shao,Qinglin Lu,Ping Luo

from arxiv, 12 pages, 6 figures

Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.

翻译：近期研究表明，将组相对策略优化（GRPO）集成到流匹配模型中，对于文本到图像和文本到视频生成尤为有效。然而，我们发现直接将此类技术应用于图像到视频（I2V）模型往往无法带来一致的奖励提升。为克服这一局限，我们提出了TAGRPO——一个受对比学习启发的、用于I2V模型的鲁棒性后训练框架。我们的方法基于以下观察：从相同初始噪声生成的推演视频能为优化提供更优的指导。利用这一洞见，我们提出了一种应用于中间隐空间的新型GRPO损失函数，该函数鼓励模型直接对齐高奖励轨迹，同时最大化与低奖励轨迹的距离。此外，我们引入了用于存储推演视频的记忆库，以增强多样性并降低计算开销。尽管方法简洁，TAGRPO在I2V生成任务上相比DanceGRPO取得了显著提升。

0

相关内容

视频

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

42+阅读 · 2020年4月11日

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

专知会员服务

29+阅读 · 2020年3月26日

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

专知会员服务

51+阅读 · 2020年3月17日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

[CVPR 2020]BEDSR-Net：单张文档图像的阴影去除深度网络

[CVPR 2020]BEDSR-Net：单张文档图像的阴影去除深度网络

专知

12+阅读 · 2020年9月30日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【CVPR2019】弱监督图像分类建模

【CVPR2019】弱监督图像分类建模

深度学习大讲堂

38+阅读 · 2019年7月25日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

图像和文本的融合表示学习——Text2Image和Image2Text

图像和文本的融合表示学习——Text2Image和Image2Text

专知

125+阅读 · 2018年6月11日

基于深层特征学习的RGB-D人体行为识别方法

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于分层稀疏表示的微动目标ISAR三维层析成像技术

国家自然科学基金

1+阅读 · 2015年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

基于组合Hodge理论的图像视频质量评价方法

国家自然科学基金

0+阅读 · 2014年12月31日

Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization

Arxiv

0+阅读 · 1月12日

L-RAG: Balancing Context and Retrieval with Entropy-Based Lazy Loading

Arxiv

0+阅读 · 1月10日

PixRec: Leveraging Visual Context for Next-Item Prediction in Sequential Recommendation

Arxiv

0+阅读 · 1月10日

Stephanie2: Thinking, Waiting, and Making Decisions Like Humans in Step-by-Step AI Social Chat

Arxiv

0+阅读 · 1月9日

TF-Mamba: Text-enhanced Fusion Mamba with Missing Modalities for Robust Multimodal Sentiment Analysis

Arxiv

0+阅读 · 1月9日

VIP会员

文章信息

相关主题

相关VIP内容

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

42+阅读 · 2020年4月11日

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

专知会员服务

29+阅读 · 2020年3月26日

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

专知会员服务

51+阅读 · 2020年3月17日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

热门VIP内容

开通专知VIP会员享更多权益服务

具身智能中的语义生命周期：基于基础模型的获取、表征与存储

《TERRADEFENDER：一个用于战略战场情报准备的统一平台》

【NTU博士论文】视频生成新突破：从人脸说话视频到通用视频制作

麻省理工学院启动新项目为人工智能时代培训军事领导者

相关资讯

[CVPR 2020]BEDSR-Net：单张文档图像的阴影去除深度网络

[CVPR 2020]BEDSR-Net：单张文档图像的阴影去除深度网络

专知

12+阅读 · 2020年9月30日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

【CVPR2019】弱监督图像分类建模

【CVPR2019】弱监督图像分类建模

深度学习大讲堂

38+阅读 · 2019年7月25日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

图像和文本的融合表示学习——Text2Image和Image2Text

图像和文本的融合表示学习——Text2Image和Image2Text

专知

125+阅读 · 2018年6月11日

相关论文

Mon3tr: Monocular 3D Telepresence with Pre-built Gaussian Avatars as Amortization

Arxiv

0+阅读 · 1月12日

L-RAG: Balancing Context and Retrieval with Entropy-Based Lazy Loading

Arxiv

0+阅读 · 1月10日

PixRec: Leveraging Visual Context for Next-Item Prediction in Sequential Recommendation

Arxiv

0+阅读 · 1月10日

Stephanie2: Thinking, Waiting, and Making Decisions Like Humans in Step-by-Step AI Social Chat

Arxiv

0+阅读 · 1月9日

TF-Mamba: Text-enhanced Fusion Mamba with Missing Modalities for Robust Multimodal Sentiment Analysis

Arxiv

0+阅读 · 1月9日

相关基金

基于深层特征学习的RGB-D人体行为识别方法

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于分层稀疏表示的微动目标ISAR三维层析成像技术

国家自然科学基金

1+阅读 · 2015年12月31日

Forward-Looking与Backward-Looking相结合的投资组合管理

国家自然科学基金

1+阅读 · 2014年12月31日

基于组合Hodge理论的图像视频质量评价方法

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员