Paper   |   Demo   |   Checkpoints   |   Datasets   |   GitHub
ONE-PEACE是一个图文音三模态通用表征模型,在语义分割、音文检索、音频分类和视觉定位几个任务都达到了新SOTA表现,在视频分类、图像分类图文检索、以及多模态经典benchmark也都取得了比较领先的结果。
另外,模型展现出来新的zeroshot能力,即实现了新的模态对齐,比如音频和图像的对齐,或者音频+文字和图像的对齐,而这类数据并没有出现在我们的预训练数据集里。
下面这张图展示了ONE-PEACE的模型架构和预训练任务。借助于扩展友好的架构和模态无关的任务,ONE-PEACE具备扩展到无限模态的潜力
# modelscope的notebook不需要安装modelscope
# pip install modelscope
git clone https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE
pip install -r requirements.txt
from modelscope.models import Model
from modelscope.pipelines import pipeline
inference = pipeline('multimodal_embedding', model='damo/ONE-PEACE-4B', model_revision='v1.0.2', use_gpu=False)
text_features = inference(["bird", "dog", "panda"], data_type='text')
image_features = inference(
['https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/dog.JPEG', 'https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/panda.JPEG'],
data_type='image'
)
audio_features = inference(
['https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/bird.flac', 'https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/dog.flac'],
data_type='audio'
)
# compute similarity
i2t_similarity = image_features @ text_features.T
a2t_similarity = audio_features @ text_features.T
print("Image-to-text similarities:", i2t_similarity)
print("Audio-to-text similarities:", a2t_similarity)
作为一个4B规模的通用表征模型,ONE-PEACE在一系列视觉、语音和多模态任务上取得领先的结果。
此外,ONE-PEACE还具备强大的多模态检索能力,能够完成图文音三模态之间的互相检索。
Task | Image classification | Semantic Segmentation | Object Detection (w/o Object365) | Video Action Recognition |
---|---|---|---|---|
Dataset | Imagenet-1K | ADE20K | COCO | Kinetics 400 |
Split | val | val | val | val |
Metric | Acc. | mIoUss / mIoUms | APbox / APmask | Top-1 Acc. / Top-5 Acc. |
ONE-PEACE | 89.8 | 62.0 / 63.0 | 60.4 / 52.9 | 88.1 / 97.8 |
Task | Audio-Text Retrieval | Audio Classification | Audio Question Answering | |||||
---|---|---|---|---|---|---|---|---|
Dataset | AudioCaps | Clotho | ESC-50 | FSD50K | VGGSound (Audio Only) | AVQA (Audio + Question) | ||
Split | test | evaluation | full | eval | test | val | ||
Metric | T2A R@1 | A2T R@1 | T2A R@1 | A2T R@1 | Zero-shot Acc. | MAP | Acc. | Acc. |
ONE-PEACE | 42.5 | 51.0 | 22.4 | 27.1 | 91.8 | 69.7 | 59.6 | 86.2 |
Task | Image-Text Retrieval (w/o ranking) | Visual Grounding | VQA | Visual Reasoning | |||||
---|---|---|---|---|---|---|---|---|---|
Dataset | COCO | Flickr30K | RefCOCO | RefCOCO+ | RefCOCOg | VQAv2 | NLVR2 | ||
Split | test | test | val / testA / testB | val / testA / testB | val-u / test-u | test-dev / test-std | dev / test-P | ||
Metric | I2T R@1 | T2I R@1 | I2T R@1 | T2I R@1 | Acc@0.5 | Acc. | Acc. | ||
ONE-PEACE | 84.1 | 65.4 | 97.6 | 89.6 | 92.58 / 94.18 / 89.26 | 88.77 / 92.21 / 83.23 | 89.22 / 89.27 | 82.6 / 82.5 | 87.8 / 88.3 |
如下图所示,我们通过case展示了ONE-PEACE的音搜图,音+图搜图,以及音+文搜图的能力。
模型主要使用开源的英文数据进行训练,因此中文的表征能力可能不太理想
如果你觉得ONE-PEACE好用,欢迎引用我们的工作:
@article{wang2023one,
title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},
author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},
journal={arXiv preprint arXiv:2305.11172},
year={2023}
}