本模型基于多阶段文本到视频生成扩散模型, 输入描述文本,返回符合文本描述的视频。仅支持英文输入。
一些文本生成视频示例如下,上方为输入文本,下方为对应的生成视频:
Mountain and water, Chinese Painting |
Landscape with bridge and waterfall, Chinese Painting |
fireworks |
Mountain river |
with starry sky in the background |
reaching into the clouds in a hilly forest at dawn |
A litter of puppies running through the yard |
Clown fish swimming through the coral reef |
A panda playing on a swing set |
a panda bear is eating bamboo on a rock |
A musk ox grazing on beautiful wildflowers |
A shark swimming in clear Carribean ocean |
sunglasses sings in a metal band on stage |
Monkey learning to play the piano |
Two kangaroos are busy cooking dinner in a kitchen |
A knight riding on a horse |
in the street in a heavy rain,oil painting |
An astronaut riding a horse on fire |
文本到视频生成扩散模型由文本特征提取、文本特征到图像特征生成扩散模型、图像特征到视频像素生成模型、视频插帧扩散模型、视频超分扩散模型这5个子网络组成,整体模型参数约60亿。支持英文输入。扩散模型采用Unet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成、插帧或超分的功能。其结构如下图所示。
本模型适用范围较广,能基于任意英文文本描述进行推理,生成视频。
目前模型暂不开放下载,后续迭代版本会开放下载,敬请期待。
训练数据包括LAION5B, ImageNet, Webvid等公开数据集。经过美学得分、水印得分、去重等预训练进行图像和视频过滤。
@inproceedings{luo2023videofusion,
title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}
@article{ramesh2022hierarchical,
title={Hierarchical text-conditional image generation with clip latents},
author={Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark},
journal={arXiv preprint arXiv:2204.06125},
year={2022}
}
@inproceedings{radford2021learning,
title={Learning transferable visual models from natural language supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
booktitle={International Conference on Machine Learning},
pages={8748--8763},
year={2021},
organization={PMLR}
}
@article{nichol2021glide,
title={Glide: Towards photorealistic image generation and editing with text-guided diffusion models},
author={Nichol, Alex and Dhariwal, Prafulla and Ramesh, Aditya and Shyam, Pranav and Mishkin, Pamela and McGrew, Bob and Sutskever, Ilya and Chen, Mark},
journal={arXiv preprint arXiv:2112.10741},
year={2021}
}