GridVLP多模态文本图像相似度-中文-通用领域-base
  • 模型资讯
  • 模型资料

GridVLP多模态文本图像相似度-中文-通用领域-base

模型描述

本方法采用Bert+Resnet50模型,使用StructBERT作为预训练模型底座,通过预训练方式进行训练,具体细节可参考论文。

期望模型使用方式以及适用范围

本模型主要用于给输入商品标题和商品图片产出类目结果。用户可以自行尝试输入中文句子。具体调用方式请参考代码示例。

如何使用

在安装ModelScope完成之后即可使用多模态商品类目预测的能力, 默认单句长度不超过128。

PS: 当文本为空或者图片URL为空的时候,从多模态预测能力转为单模态商品类目预测。

代码范例

from modelscope.pipelines.multi_modal.gridvlp_pipeline import GridVlpClassificationPipeline

pipeline = GridVlpClassificationPipeline('rgtjf1/multi-modal_gridvlp_classification_chinese-base-ecom-cate')
inputs = pipeline.preprocess({'text': '女装快干弹力轻型短裤448575','image_url':'https://yejiabo-public.oss-cn-zhangjiakou.aliyuncs.com/alinlp/clothes.png'})
outputs = pipeline.forward(inputs)
print(outputs)

数据评估及结果

类目预测 品牌预测 同款检索
模型 Top1 Acc Top5 Acc Top1 Acc Top5 Acc Recall@1 Recall@100
ResNet50 58.6 79.5 37.1 57.6 - -
StructBERT 70.2 81.2 43.3 62.3 39.2 81.3
Multimodal Fusion 73.3 86.6 45.1 63.0 - -
GridVLP 82.7 94.4 75.4 85.4 46.7 89.5

相关论文以及引用信息

如果你觉得这个该模型对有所帮助,请考虑引用下面的相关的论文:

@article{10.1145/3572833,
author = {Yan, Ming and Xu, Haiyang and Li, Chenliang and Tian, Junfeng and Bi, Bin and Wang, Wei and Xu, Xianzhe and Zhang, Ji and Huang, Songfang and Huang, Fei and Si, Luo and Jin, Rong},
title = {Achieving Human Parity on Visual Question Answering},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1046-8188},
url = {https://doi.org/10.1145/3572833},
doi = {10.1145/3572833},
abstract = {The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade. This paper introduces a novel hierarchical integration of vision and language AliceMind-MMU (ALIbaba’s Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy - MultiMedia Understanding), which leads to similar or even slightly better results than human being does on VQA. A hierarchical framework is designed to tackle the practical problems of VQA in a cascade manner including: (1) diverse visual semantics learning for comprehensive image content understanding; (2) enhanced multi-modal pre-training with modality adaptive attention; and (3) a knowledge-guided model integration with three specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding expertise needed plays an important role in boosting the performance of our VQA architecture up to the human level. An extensive set of experiments and analysis are conducted to demonstrate the effectiveness of the new research work.},
note = {Just Accepted},
journal = {ACM Trans. Inf. Syst.},
month = {dec},
keywords = {cross-modal interaction, visual reasoning, visual question answering, multi-modal pre-training, text and image content analysis}
}

@article{DBLP:journals/corr/abs-2108-09479,
  author    = {Ming Yan and
               Haiyang Xu and
               Chenliang Li and
               Bin Bi and
               Junfeng Tian and
               Min Gui and
               Wei Wang},
  title     = {Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training},
  journal   = {CoRR},
  volume    = {abs/2108.09479},
  year      = {2021},
  url       = {https://arxiv.org/abs/2108.09479},
  eprinttype = {arXiv},
  eprint    = {2108.09479},
  timestamp = {Fri, 27 Aug 2021 15:02:29 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2108-09479.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}