ONE-PEACE-通用表征模型-英文-4B
ONE-PEACE是一个4B规模的图文音通用表征模型,可以产出图文音通用表征,实现三个模态的互相检索
  • 模型资讯
  • 模型资料

ONE-PEACE



Paper   |   Demo   |   Checkpoints   |   Datasets   |   GitHub



ONE-PEACE是什么

ONE-PEACE是一个图文音三模态通用表征模型,在语义分割、音文检索、音频分类和视觉定位几个任务都达到了新SOTA表现,在视频分类、图像分类图文检索、以及多模态经典benchmark也都取得了比较领先的结果。
另外,模型展现出来新的zeroshot能力,即实现了新的模态对齐,比如音频和图像的对齐,或者音频+文字和图像的对齐,而这类数据并没有出现在我们的预训练数据集里。

下面这张图展示了ONE-PEACE的模型架构和预训练任务。借助于扩展友好的架构和模态无关的任务,ONE-PEACE具备扩展到无限模态的潜力



如何玩转ONE-PEACE

基础配置

# modelscope的notebook不需要安装modelscope
# pip install modelscope
git clone https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE 
pip install -r requirements.txt

开始玩起来

from modelscope.models import Model
from modelscope.pipelines import pipeline

inference = pipeline('multimodal_embedding', model='damo/ONE-PEACE-4B', model_revision='v1.0.2', use_gpu=False)
text_features = inference(["bird", "dog", "panda"], data_type='text')
image_features = inference(
    ['https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/dog.JPEG', 'https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/panda.JPEG'],
    data_type='image'
)
audio_features = inference(
    ['https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/bird.flac', 'https://one-peace-shanghai.oss-cn-shanghai.aliyuncs.com/modelscope_case/dog.flac'],
    data_type='audio'
)

# compute similarity
i2t_similarity = image_features @ text_features.T
a2t_similarity = audio_features @ text_features.T
print("Image-to-text similarities:", i2t_similarity)
print("Audio-to-text similarities:", a2t_similarity)



为什么ONE-PEACE是多模态表征模型的最佳选择?

作为一个4B规模的通用表征模型,ONE-PEACE在一系列视觉、语音和多模态任务上取得领先的结果。
此外,ONE-PEACE还具备强大的多模态检索能力,能够完成图文音三模态之间的互相检索。

下游任务结果

视觉任务

TaskImage classificationSemantic SegmentationObject Detection (w/o Object365)Video Action Recognition
DatasetImagenet-1KADE20KCOCOKinetics 400
Splitvalvalvalval
MetricAcc.mIoUss / mIoUmsAPbox / APmaskTop-1 Acc. / Top-5 Acc.
ONE-PEACE89.862.0 / 63.060.4 / 52.988.1 / 97.8

语音(-文本)任务

TaskAudio-Text RetrievalAudio ClassificationAudio Question Answering
DatasetAudioCapsClothoESC-50FSD50KVGGSound (Audio Only)AVQA (Audio + Question)
Splittestevaluationfullevaltestval
MetricT2A R@1A2T R@1T2A R@1A2T R@1Zero-shot Acc.MAPAcc.Acc.
ONE-PEACE42.551.022.427.191.869.759.686.2

图文多模态任务

TaskImage-Text Retrieval (w/o ranking)Visual GroundingVQAVisual Reasoning
DatasetCOCOFlickr30KRefCOCORefCOCO+RefCOCOgVQAv2NLVR2
Splittesttestval / testA / testBval / testA / testBval-u / test-utest-dev / test-stddev / test-P
MetricI2T R@1T2I R@1I2T R@1T2I R@1Acc@0.5Acc.Acc.
ONE-PEACE84.165.497.689.692.58 / 94.18 / 89.2688.77 / 92.21 / 83.2389.22 / 89.2782.6 / 82.587.8 / 88.3

多模态检索

如下图所示,我们通过case展示了ONE-PEACE的音搜图,音+图搜图,以及音+文搜图的能力。

a2i

at2i

ai2i


模型局限性以及可能的偏差

模型主要使用开源的英文数据进行训练,因此中文的表征能力可能不太理想


相关论文以及引用

如果你觉得ONE-PEACE好用,欢迎引用我们的工作:

@article{wang2023one,
  title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},
  author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},
  journal={arXiv preprint arXiv:2305.11172},
  year={2023}
}