本模型为mPLUG-图像描述的Large模型,参数量约为6亿,可以用于finetune下游的VQA,Caption等多模态任务。
mPLUG模型是统一理解和生成的多模态基础模型,该模型提出了基于skip-connections的高效跨模态融合框架。其中,mPLUG论文公开时在VQA,MS COCO Caption数据上达到SOTA,详见:mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
本模型主要用于给问题和对应图片生成答案。用户可以自行尝试各种输入文档。具体调用方式请参考代码示例。
在安装完成MaaS-lib之后即可使用image-captioning的能力
给出基于mplug finetune后的VQA推理样例
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
model_id = 'damo/mplug_visual-question-answering_coco_large_en'
input_vqa = {
'image': 'https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_mplug_vqa.jpg',
'question': 'What is the woman doing?',
}
pipeline_vqa = pipeline(Tasks.visual_question_answering, model=model_id)
print(pipeline_vqa(input_vqa))
模型在数据集上训练,有可能产生一些偏差,请用户自行评测后决定如何使用。
本模型训练数据集是MS COCO Caption, 具体数据可以下载
import tempfile
from modelscope.msdatasets import MsDataset
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
datadict = MsDataset.load('coco_captions_small_slice')
train_dataset = MsDataset(
datadict['train'].remap_columns({
'image:FILE': 'image',
'answer:Value': 'answer'
}).map(lambda _: {'question': 'what the picture describes?'}))
test_dataset = MsDataset(
datadict['test'].remap_columns({
'image:FILE': 'image',
'answer:Value': 'answer'
}).map(lambda _: {'question': 'what the picture describes?'}))
# 可以在代码修改 configuration 的配置
def cfg_modify_fn(cfg):
cfg.train.hooks = [{
'type': 'CheckpointHook',
'interval': 2
}, {
'type': 'TextLoggerHook',
'interval': 1
}, {
'type': 'IterTimerHook'
}]
return cfg
kwargs = dict(
model='damo/mplug_image-captioning_coco_large_en',
train_dataset=train_dataset,
eval_dataset=test_dataset,
max_epochs=2,
cfg_modify_fn=cfg_modify_fn,
work_dir=tempfile.TemporaryDirectory().name)
trainer = build_trainer(
name=Trainers.nlp_base_trainer, default_args=kwargs)
trainer.train()
mPLUG在VQA数据集,同等规模和预训练数据的模型中取得SOTA,VQA榜单上排名前列
如果我们的模型对您有帮助,请您引入我们的文章:
@inproceedings{li2022mplug,
title={mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections},
author={Li, Chenliang and Xu, Haiyang and Tian, Junfeng and Wang, Wei and Yan, Ming and Bi, Bin and Ye, Jiabo and Chen, Hehong and Xu, Guohai and Cao, Zheng and Zhang, Ji and Huang, Songfang and Huang, Fei and Zhou, Jingren and Luo Si},
year={2022},
journal={arXiv}
}