基于视觉和语言的知识蒸馏的开放词汇目标检测模型介绍

模型描述

利用视觉和语言的知识蒸馏提取来学习开放词汇目标检测，提出了一种从开放词汇图像分类模型中知识蒸馏的开放词汇检测方法ViLD，ViLD是第一个在具有挑战性的LVIS数据集上评估的开放式词汇检测方法。在lvis数据集的测试效果中达到16.1 APr，在相同的推理速度下超过了其他监督模型。

期望模型使用方式以及适用范围

该模型可用于任意类别的物体检测

依赖

推荐基于ModelScope官方镜像使用，获取地址。
在此基础上需要安装tensorflow>=2.9。暂不支持cpu。

pip install tensorflow==2.9.2 -i https://pypi.tuna.tsinghua.edu.cn/simple

代码范例

import os
os.system('pip install tensorflow==2.9.2 -i https://pypi.tuna.tsinghua.edu.cn/simple')


from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.outputs import OutputKeys

vild_pipeline = pipeline(Tasks.open_vocabulary_detection, model='damo/cv_resnet152_open-vocabulary-detection_vild')

image_path = 'https://modelscope.oss-cn-beijing.aliyuncs.com/test/images/image_open_vocabulary_detection.jpg'

# 检测类别名输入为一个字符串，多个类别时用 “;” 间隔开
category_names =  ';'.join([
        'flipflop', 'street sign', 'bracelet', 'necklace', 'shorts',
        'floral camisole', 'orange shirt', 'purple dress', 'yellow tee',
        'green umbrella', 'pink striped umbrella', 'transparent umbrella',
        'plain pink umbrella', 'blue patterned umbrella', 'koala',
        'electric box', 'car', 'pole'
        ])

input_dict = {'img':image_path, 'category_names':category_names}

result = vild_pipeline(input_dict)
print(result[OutputKeys.BOXES])

模型局限性以及可能的偏差

对于一些特殊的物体，检出的类别可能不准确，置信度较低等，这与Clip的数据与lvis基础训练数据有关

训练数据介绍

本模型是基于以下开源数据集训练得到：

lvis

数据评估及结果

Method	Backbone	Distillation weight	APr	APc	APf	AP
ViLD-ensemble	ResNet-152	2.0	19.2	24.8	30.8	26.2