本模型是基于modelscope text-to-video模型使用9,923个视频和29,769标注帧微调得到,能够消除原模型中的水印现象,需配合zeroscope_v2_576w版本能生成1024x576的高分辨率视频。单独使用该模型可能存在效果不稳定现象。
文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成,整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构,通过从纯高斯噪声视频中,迭代去噪的过程,实现视频生成的功能。
本模型适用范围较广,能基于任意英文文本描述进行推理,生成视频。一些文本生成视频示例如下,上方为输入文本,下方为对应的生成视频:
A panda eating bamboo on a rock |
Clown fish swimming through the coral reef |
pip install modelscope==1.4.2
pip install open_clip_torch
pip install pytorch-lightning
from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys
# 输出帧数可以通过在configuration.json文件里修改max_frames实现
p = pipeline('text-to-video-synthesis', 'baiguan18/zeroscope_v2_xl')
test_text = {
'text': 'A panda eating bamboo on a rock',
'out_height': 576,
'out_width': 1024,
}
output_video_path = p(test_text, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
print('output_video_path:', output_video_path)
上述代码会展示输出视频的保存路径,目前编码格式采用VLC播放器可以正常播放。系统默认播放器可能无法正常播放本模型生成的视频。
训练数据包括LAION5B, ImageNet, Webvid等公开数据集。经过美学得分、水印得分、去重等预训练进行图像和视频过滤。
@inproceedings{luo2023videofusion,
title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}
@inproceedings{rombach2022high,
title={High-resolution image synthesis with latent diffusion models},
author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10684--10695},
year={2022}
}
@inproceedings{Bain21,
author={Max Bain and Arsha Nagrani and G{\"u}l Varol and Andrew Zisserman},
title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval},
booktitle={IEEE International Conference on Computer Vision},
year={2021},
}