本模型基于多阶段文本到视频生成扩散模型, 输入描述文本,返回符合文本描述的视频。仅支持英文输入。
This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.
文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成,整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成的功能。
The text-to-video generation diffusion model consists of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. The overall model parameters are about 1.7 billion. Support English input. The diffusion model adopts the Unet3D structure, and realizes the function of video generation through the iterative denoising process from the pure Gaussian noise video.
本模型适用范围较广,能基于任意英文文本描述进行推理,生成视频。一些文本生成视频示例如下,上方为输入文本,下方为对应的生成视频:
This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions. Some generated video examples are as follows, the upper part is the input text, and the lower part is the corresponding generated video:
Robot dancing in times square. |
the coral reef. |
down the cone. |
A waterfall flowing through glacier at night. |
in style of van Gogh. |
Tiny plant sprout coming out of the ground. |
Hyper-realistic photo of an abandoned industrial site during a storm. |
Balloon full of water exploding in extreme slow motion. |
set on an alien planet, view of a marketplace. Pixel art. |
为便于体验模型,用户可以参考Notebook快速开发文生视频-教程。
模型已经在创空间和huggingface上线,可以直接体验。
In order to facilitate the experience of the model, users can refer to the Aliyun Notebook Tutorial to quickly develop this Text-to-Video model.
The model has been launched on ModelScope Studio and huggingface, you can experience it directly.
该模型暂仅支持在GPU上进行推理。模型需要硬件配置大约是 16GB 内存和 16GB GPU显存。在ModelScope框架下,通过调用简单的Pipeline即可使用当前模型,其中,输入需为字典格式,合法键值为’text’,内容为一小段文本。输入具体代码示例如下:
This model currently only supports inference on the GPU. This demo requires about 16GB CPU RAM and 16GB GPU RAM. Under the ModelScope framework, the current model can be used by calling a simple Pipeline, where the input must be in dictionary format, the legal key value is ‘text’, and the content is a short text. Enter specific code examples as follows:
[2023.03.21 更新] ModelScope发布1.4.2版本,text-to-video-synthesis 模型更新到模型参数文件 v1.1.0。
[2023.03.21 update] ModelScope released version 1.4.2, and the text-to-video-synthesis model updated the model parameter file into v1.1.0.
pip install modelscope==1.4.2
pip install open_clip_torch
pip install pytorch-lightning
from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys
p = pipeline('text-to-video-synthesis', 'damo/text-to-video-synthesis')
test_text = {
'text': 'A panda eating bamboo on a rock.',
}
output_video_path = p(test_text, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
print('output_video_path:', output_video_path)
上述代码会展示输出视频的保存路径,目前编码格式采用VLC播放器可以正常播放。系统默认播放器可能无法正常播放本模型生成的视频。
The above code will display the save path of the output mp4 video, and the current encoding format can be played normally with VLC player. Some other media players may not view it normally.
模型基于Webvid等公开数据集进行训练,生成结果可能会存在与训练数据分布相关的偏差。
该模型无法实现完美的影视级生成。
该模型无法生成清晰的文本。
该模型主要是用英文语料训练的,暂不支持其他语言。
该模型在复杂的组合性生成任务上表现有待提升。
The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data.
This model cannot achieve perfect film and television quality generation.
The model cannot generate clear text.
The model is mainly trained with English corpus and does not support other languages at the moment**.
The performance of this model needs to be improved on complex compositional generation tasks.
本模型是为非商业目的提供,仅供研究使用。
该模型未经过训练以真实地表示人或事件,因此使用该模型生成此类内容超出了该模型的能力范围。
禁止用于对人或其环境、文化、宗教等产生贬低、或有害的内容生成。
禁止用于涉黄、暴力和血腥内容生成。
禁止用于错误和虚假信息生成。
The model can only be used for non-commercial purposes. The model is meant for research purposes.
The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model’s capabilities.
It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc.
Prohibited for pornographic, violent and bloody content generation.
Prohibited for error and false information generation.
训练数据包括LAION5B, ImageNet, Webvid等公开数据集。经过美学得分、水印得分、去重等预训练进行图像和视频过滤。
The training data includes LAION5B, ImageNet, Webvid and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.
@inproceedings{luo2023videofusion,
title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}
@inproceedings{rombach2022high,
title={High-resolution image synthesis with latent diffusion models},
author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10684--10695},
year={2022}
}
@inproceedings{Bain21,
author={Max Bain and Arsha Nagrani and G{\"u}l Varol and Andrew Zisserman},
title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval},
booktitle={IEEE International Conference on Computer Vision},
year={2021},
}