LLaVA视觉问答模型

模型描述 (Model Description)

🌋 LLaVA: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Paper]

Visual Instruction Tuning

Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Generated by GLIGEN via "a cute lava llama with glasses" and box prompt

运行环境 (Operating environment)

Install

git clone the original repository

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

Install Package

conda create -n llava python=3.10 -y
conda activate llava
进入pyproject.toml所在文件夹
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

LLaVA Weights

提供模型运行所需完整13B权重，使用需遵循LLaMA model license。

代码范例 (Code example)

The current implementation only supports for a single-turn Q-A session, and the interactive CLI is WIP.
This also serves as an example for users to build customized inference scripts.
（运行时，可将ms_wrapper.py文件复制到原git仓库路径下，添加下列代码运行）

from modelscope.models import Model
from modelscope.pipelines import pipeline

model_id = 'xingzi/llava_visual-question-answering'
inference = pipeline('llava-task', model=model_id, model_revision='v1.1.0')

image_file = "https://llava-vl.github.io/static/images/view.jpg"
query = "What are the things I should be cautious about when I visit here?"
conv_mode = None
inputs = {'image_file': image_file, 'query': query}

output = inference(inputs, conv_mode=conv_mode)

print(output)

#注：模型加载可能需要几分钟的时间

Citation

If you find LLaVA useful for your your research and applications, please cite using this BibTeX:

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={arXiv:2304.08485},
      year={2023},
}

Acknowledgement

Vicuna: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!