Patch Shift Transformers(PST) 是在2D Swin-Transformer的基础上,增加temporal建模能力,使网络具备视频时空特征学习能力。而这一操作几乎不增加额外参数。具体地,通过shift不同帧之间的patch, 然后在每帧内部分别进行self-attention 运算,这样使用2D的self-attention计算量来进行视频的时空特征建模,论文原文链接。
PatchShift示意图:
使用方式:
使用范围:
目标场景:
import cv2
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
#创建pipeline
action_recognition_pipeline = pipeline(Tasks.action_recognition, 'damo/cv_pathshift_action-recognition')
#运行pipeline,输入视频的本地路径或者网络地址均可
result = action_recognition_pipeline('http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/facebody/RecognizeAction/RecognizeAction-video2.mp4')
print(f'action recognition result: {result}.')
输出:
{'labels': 'abseiling'}
在Something-Something V1 & V2,Kinetics400数据集上的模型性能:
Dataset | Model | Top@1 | Top@5 |
---|---|---|---|
Sthv1 | PST-Tiny | 54.0 | 82.3 |
Sthv1 | PST-Base | 58.3 | 83.9 |
Sthv2 | PST-Tiny | 67.9 | 90.8 |
Sthv2 | PST-Base | 69.8 | 93.0 |
K400 | PST-Tiny | 78.6 | 93.5 |
K400 | PST-Base | 82.5 | 95.6 |
更多模型训练和测试细节可参考论文和开源代码。
如果你觉得这个该模型对有所帮助,请考虑引用下面的论文:
@article{xiang2022tps,
title={Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition},
author={Wangmeng Xiang, Chao Li, Biao Wang, Xihan Wei, Xian-Sheng Hua, Lei Zhang},
journal={Proceedings of the European Conference on Computer Vision},
year={2022}
}