高分辨率文本生成视频大模型

本模型是基于modelscope text-to-video模型使用9,923个视频和29,769标注帧微调得到，能够消除原模型中的水印现象，需配合zeroscope_v2_576w版本能生成1024x576的高分辨率视频。单独使用该模型可能存在效果不稳定现象。

模型描述

文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成，整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构，通过从纯高斯噪声视频中，迭代去噪的过程，实现视频生成的功能。

期望模型使用方式以及适用范围

本模型适用范围较广，能基于任意英文文本描述进行推理，生成视频。一些文本生成视频示例如下，上方为输入文本，下方为对应的生成视频：

A panda eating bamboo on a rock

Clown fish swimming through the coral reef

如何使用

运行环境

pip install modelscope==1.4.2
pip install open_clip_torch
pip install pytorch-lightning

代码范例

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

# 输出帧数可以通过在configuration.json文件里修改max_frames实现
p = pipeline('text-to-video-synthesis', 'baiguan18/zeroscope_v2_xl')
test_text = {
        'text': 'A panda eating bamboo on a rock',
        'out_height': 576,
        'out_width': 1024,
    }
output_video_path = p(test_text, output_video='./output.mp4')[OutputKeys.OUTPUT_VIDEO]
print('output_video_path:', output_video_path)

查看结果

上述代码会展示输出视频的保存路径，目前编码格式采用VLC播放器可以正常播放。系统默认播放器可能无法正常播放本模型生成的视频。

模型局限性以及可能的偏差

模型基于Webvid等公开数据集进行训练，生成结果可能会存在与训练数据分布相关的偏差。
该模型无法实现完美的影视级生成。
该模型无法生成清晰的文本。
该模型主要是用英文语料训练的，暂不支持其他语言。
该模型在复杂的组合性生成任务上表现有待提升。

滥用、恶意使用和超出范围的使用

本模型是为非商业目的提供，仅供研究使用。
该模型未经过训练以真实地表示人或事件，因此使用该模型生成此类内容超出了该模型的能力范围。
禁止用于对人或其环境、文化、宗教等产生贬低、或有害的内容生成。
禁止用于涉黄、暴力和血腥内容生成。
禁止用于错误和虚假信息生成。

训练数据介绍

训练数据包括LAION5B, ImageNet, Webvid等公开数据集。经过美学得分、水印得分、去重等预训练进行图像和视频过滤。

相关论文以及引用信息

@inproceedings{luo2023videofusion,
  title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
  author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

@inproceedings{rombach2022high,
  title={High-resolution image synthesis with latent diffusion models},
  author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10684--10695},
  year={2022}
}

@inproceedings{Bain21,
  author={Max Bain and Arsha Nagrani and G{\"u}l Varol and Andrew Zisserman},
  title={Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval},
  booktitle={IEEE International Conference on Computer Vision},
  year={2021},
}