Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
[Project Page] [Paper]
Visual Instruction Tuning
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)
Generated by GLIGEN via "a cute lava llama with glasses" and box prompt
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
conda create -n llava python=3.10 -y
conda activate llava
进入pyproject.toml所在文件夹
pip install --upgrade pip # enable PEP 660 support
pip install -e .
提供模型运行所需完整13B权重,使用需遵循LLaMA model license。
The current implementation only supports for a single-turn Q-A session, and the interactive CLI is WIP.
This also serves as an example for users to build customized inference scripts.
(运行时,可将ms_wrapper.py文件复制到原git仓库路径下,添加下列代码运行)
from modelscope.models import Model
from modelscope.pipelines import pipeline
model_id = 'xingzi/llava_visual-question-answering'
inference = pipeline('llava-task', model=model_id, model_revision='v1.1.0')
image_file = "https://llava-vl.github.io/static/images/view.jpg"
query = "What are the things I should be cautious about when I visit here?"
conv_mode = None
inputs = {'image_file': image_file, 'query': query}
output = inference(inputs, conv_mode=conv_mode)
print(output)
#注:模型加载可能需要几分钟的时间
If you find LLaVA useful for your your research and applications, please cite using this BibTeX:
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={arXiv:2304.08485},
year={2023},
}