视觉encoder采用vit-large-patch14结构,文本encoder采用bert-base结构。
模型在多个中文图文检索数据集上进行了zero-shot效果测试,并达到state-of-the-art效果。
Model | layers | width | heads | embedding dim |
---|---|---|---|---|
Vision Transformer | 24 | 1024 | 16 | 768 |
Text Transformer | 12 | 768 | 12 | 768 |
使用方式:
使用场景:
代码范例:
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
test_img = 'data/test/images/multimodal_similarity.jpg'
test_str1 = '一个上了年纪的女人在城镇中骑着自行车一个黄色出租车正要从她身边驶过'
test_str2 = '穿着蓝色连衣裙的那个女人正冲着行来的车辆伸出她的手'
multi_modal_similarity_pipeline = pipeline(task=Tasks.multi_modal_similarity)
test_input1 = {'img': test_img, 'text': test_str1}
test_input2 = {'img': test_img, 'text': test_str2}
output1 = multi_modal_similarity_pipeline(test_input1)
output2 = multi_modal_similarity_pipeline(test_input2)
print('image: {}, text: {}, similarity: {}'.format(test_img, test_str1, output1['scores']))
print('image: {}, text: {}, similarity: {}'.format(test_img, test_str2, output2['scores']))
–图像输入:RandomResizedCrop到224*224,随机水平翻转
–文本输入:最多保留30个token
初始LR为0.001,每30000个iteration之后减小为1/5,共训练90000个iteration。
该模型在3个公开中文图文检索数据集上进行了zero-shot评测,Top1检索准确率为:
Dataset | COCO-CN | Flickr30K-CN | Flickr8K-CN |
---|---|---|---|
Text Retrieval | 67.7 | 88.1 | 77.7 |
Image Retrieval | 66.7 | 69.8 | 63.3 |
@inproceedings{TEAM2022MM,
title = {Token Embeddings Alignment for Cross-Modal Retrieval},
author = {Xie, Chen-Wei and Wu, Jianmin and Zheng, Yun and Pan, Pan and Hua, Xian-Sheng},
booktitle = {ACMMM},
year = {2022}
}