通义-文本生成视频大模型-英文-通用领域-v1.0
通义-文本生成视频大模型-英文-通用领域-v1.0
  • 模型资讯
  • 模型资料

通义-文本生成视频大模型-英文-通用领域-v1.0

本模型基于多阶段文本到视频生成扩散模型, 输入描述文本,返回符合文本描述的视频。仅支持英文输入。

一些文本生成视频示例如下,上方为输入文本,下方为对应的生成视频:


Mountain and water, Chinese Painting
Mountain and water, Chinese Painting

Landscape with bridge and waterfall, Chinese Painting
Landscape with_bridge waterfall, Chinese_Painting

fireworks
fireworks

Mountain river
Mountain river
campfire at night in a snowy forest
with starry sky in the background
campfire at night in a snowy forest with starry sky in the background
View of a castle with fantastically high towers
reaching into the clouds in a hilly forest at dawn
View of a castle with fantastically high towers reaching into the clouds in a hilly forest at dawn

A litter of puppies running through the yard
A litter of puppies running through the yard

Clown fish swimming through the coral reef
Clown fish swimming through the coral reef

A panda playing on a swing set
A panda playing on a swing set

a panda bear is eating bamboo on a rock
a panda bear is eating bamboo on a rock

A musk ox grazing on beautiful wildflowers
A musk ox grazing on beautiful wildflowers

A shark swimming in clear Carribean ocean
A shark swimming in clear Carribean ocean
An orange cat wearing a leather jacket and
sunglasses sings in a metal band on stage
An orange cat wearing a leather jacket and sunglasses sings in a metal band on stage

Monkey learning to play the piano
Monkey learning to play the piano

Two kangaroos are busy cooking dinner in a kitchen
Two kangaroos are busy cooking dinner in a kitchen

A knight riding on a horse
A knight riding on a horse
a couple in formal evening wear dancing
in the street in a heavy rain,oil painting
a couple in formal evening wear dancing in the street in a heavy rain,oil painting

An astronaut riding a horse on fire
An astronaut riding a horse on fire

模型描述

文本到视频生成扩散模型由文本特征提取、文本特征到图像特征生成扩散模型、图像特征到视频像素生成模型、视频插帧扩散模型、视频超分扩散模型这5个子网络组成,整体模型参数约60亿。支持英文输入。扩散模型采用Unet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成、插帧或超分的功能。其结构如下图所示。

framework

  • 文本特征提取:利用图文预训练模型CLIP ViT-L/14@336px的text encoder来提取文本特征。
  • 文本到图像特征扩散模型:Diffusion prior部分,以CLIP text embedding为条件,输出CLIP image embedding。
  • 图像特征到64x64视频生成模型:同样采用diffusion model,以GLIDE模型中UNet结构为基础改造UNet3D结构,采用cross attention实现image embedding 嵌入,输出16x64x64视频。
  • 视频插帧扩散模型(16X64x64到64X64x64):diffusion视频插帧模型,输入包括16x64x64视频、image embedding,输出64X64x64视频,其中16x64x64视频复制4次到64X64x64以concat形式输入、image embedding同样以cross attention形式嵌入。
  • 视频超分扩散模型(64X64x64到64X256x256):diffusion视频超分模型,同样为UNet3D结构,推理过程输入64X64x64视频,输出64X256x256视频。

期望模型使用方式以及适用范围

本模型适用范围较广,能基于任意英文文本描述进行推理,生成视频。

如何使用

目前模型暂不开放下载,后续迭代版本会开放下载,敬请期待。

模型局限性以及可能的偏差

  • 模型基于Webvid等公开数据集进行训练,生成结果可能会存在与训练数据分布相关的偏差。
  • 该模型无法实现完美的影视级生成。
  • 该模型无法生成清晰的文本。
  • 该模型主要是用英文语料训练的,暂不支持其他语言。
  • 该模型在复杂的组合性生成任务上表现有待提升。

滥用、恶意使用和超出范围的使用

  • 该模型未经过训练以真实地表示人或事件,因此使用该模型生成此类内容超出了该模型的能力范围。
  • 禁止用于对人或其环境、文化、宗教等产生贬低、或有害的内容生成。
  • 禁止用于涉黄、暴力和血腥内容生成。
  • 禁止用于错误和虚假信息生成。

训练数据介绍

训练数据包括LAION5B, ImageNet, Webvid等公开数据集。经过美学得分、水印得分、去重等预训练进行图像和视频过滤。

相关论文以及引用信息

@inproceedings{luo2023videofusion,
  title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
  author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}
@article{ramesh2022hierarchical,
  title={Hierarchical text-conditional image generation with clip latents},
  author={Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark},
  journal={arXiv preprint arXiv:2204.06125},
  year={2022}
}
@inproceedings{radford2021learning,
  title={Learning transferable visual models from natural language supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  booktitle={International Conference on Machine Learning},
  pages={8748--8763},
  year={2021},
  organization={PMLR}
}
@article{nichol2021glide,
  title={Glide: Towards photorealistic image generation and editing with text-guided diffusion models},
  author={Nichol, Alex and Dhariwal, Prafulla and Ramesh, Aditya and Shyam, Pranav and Mishkin, Pamela and McGrew, Bob and Sutskever, Ilya and Chen, Mark},
  journal={arXiv preprint arXiv:2112.10741},
  year={2021}
}