Stable Diffusion for 360 Panorama Image Generation 文本生成360全景图模型

该模型为文本生成360度全景图像模型，输入描述文本，实现端到端360度全景图生成。

文本生成360度全景图图像大模型

本模型基于多阶段文本到图像生成扩散模型, 输入描述文本，返回符合文本描述的360度全景图像。仅支持英文输入。

例如，输入 “A living room.”，可能会得到如下图像：

A living room.

输入 “The Mountains.”，可能会得到如下图像：

The Mountains.

输入 “The Times Square.”，可能会得到如下图像：

The Times Square.

模型描述

该模型基于Stable Diffusion v2.1， ControlNet v1.1 与diffusers进行构建。

模型期望使用方式和适用范围

该模型适用于多种场景（室内、室外）的文本输入，给定输入文本，生成对应360全景图像，分辨率为3072*6144；
该模型推理时对机器GPU显存有一定要求；在FP16模式下并开启enable_xformers_memory_efficient_attention选项时，要求显存大于20GB。

如何使用Pipeline

在 ModelScope 框架上，提供输入文本，即可以通过简单的 Pipeline 调用来使用360全景图生成模型。

安装说明

创建虚拟环境

conda create -n panogen python=3.8
conda activate panogen

安装深度学习框架

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

ModelScope Library 安装

pip install modelscope
pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

RealESRGAN 安装

通过源码安装

官方链接：https://github.com/xinntao/Real-ESRGAN#installation

通过pip安装

pip install realesrgan==0.3.0

其他库安装

pip install -U diffusers==0.18.0
pip install xformers==0.0.16
pip install triton, accelerate, transformers

推理代码范例

import cv2
from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

prompt = "The living room."

input = {
    'prompt': prompt,
    }
txt2panoimg = pipeline(Tasks.text_to_360panorama_image,
                       model='damo/cv_diffusion_text-to-360panorama-image_generation')
output = txt2panoimg(input)[OutputKeys.OUTPUT_IMG]
cv2.imwrite('result.png', output)

推理代码说明

Pipeline初始化参数
- 可缺省参数torch_dtype，默认值为torch.float16，可设置为torch.float32。
- 可缺省参数enable_xformers_memory_efficient_attention，默认值为True，开启将减少GPU显存占用，可关闭。
Pipeline调用参数
- 输入要求：输入字典中必须指定的字段有’prompt’；其他可选输入字段及其默认值包括：

"num_inference_steps": 20,
"guidance_scale": 7.5,
"add_prompt": "photorealistic, trend on artstation, ((best quality)), ((ultra high res))",
"negative_prompt": "persons, complex texture, small objects, sheltered, blur, worst quality, low quality, zombie, logo, text, watermark, username, monochrome, complex lighting",
"seed": -1,
"upscale": True,
"refinement": True

由于GPU显存限制，本项目默认支持开启FP16推理，并设置enable_xformers_memory_efficient_attention为True，可以在构建pipeline时传入参数torch_dtype=torch.float32来使用FP32，传入enable_xformers_memory_efficient_attention=False来关闭xformers的使用。

训练相关

本方案将360全景图视作一种风格图像，采用DreamBooth方法，使用约2000张360全景图像进行风格模型finetuning，总共训练40个epochs。

模型局限性以及可能的偏差

当输入文本描述过长时，全景图的左右拼接处会有拼接的痕迹。
在一些场景下，指定某些不同的Prompt时，可能生成的全景图没有那么符合文本描述；可以生成多次，取效果较好的结果。
暂不支持更改图片分辨率。

说明与引用

本算法模型源自一些开源项目：

全景图数据来源

https://pixexid.com/search/360-panorama

如果你觉得这个模型对你有所帮助，请考虑引用下面的相关论文：

@article{ruiz2022dreambooth,
  title={DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation},
  author={Ruiz, Nataniel and Li, Yuanzhen and Jampani, Varun and Pritch, Yael and Rubinstein, Michael and Aberman, Kfir},
  booktitle={arXiv preprint arxiv:2208.12242},
  year={2022}
}
@misc{von-platen-etal-2022-diffusers,
  author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
  title = {Diffusers: State-of-the-art diffusion models},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/diffusers}}
}
@misc{zhang2023adding,
  title={Adding Conditional Control to Text-to-Image Diffusion Models}, 
  author={Lvmin Zhang and Maneesh Agrawala},
  year={2023},
  eprint={2302.05543},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}