End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

Keval Doshi; Yasin Yilmaz

ゼロショットアクション認識のためのエンドツーエンドセマンティックビデオトランスフォーマー

ビデオアクション認識は数年前から活発に研究されてきましたが、ゼロショットアクション認識が勢いを増し始めたのはごく最近のことです。この作業では、3D-CNN を使用する既存のアプローチとは対照的に、長距離の時空間依存性を効率的にキャプチャできる、新しいエンドツーエンドのトレーニング済みトランスフォーマーモデルを提案します。さらに、以前には見られなかったと見なすことができるクラスに関する既存の作品の一般的なあいまいさに対処するために、トレーニングクラスとテストクラスの間の重複を回避することにより、アクション認識のゼロショット学習前提を満たす新しい実験セットアップを提案します。提案されたアプローチは、UCF-101、HMDB-51、および ActivityNet データセットでのトップ 1 の精度という点で、ゼロショットアクション認識の最先端技術を大幅に上回っています。コードと提案された実験セットアップは、GitHub で入手できます: https://github.com/Secure-and-Intelligent-Systems-Lab/SemanticVideoTransformer

While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is capable of capturing long range spatiotemporal dependencies efficiently, contrary to existing approaches which use 3D-CNNs. Moreover, to address a common ambiguity in the existing works about classes that can be considered as previously unseen, we propose a new experimentation setup that satisfies the zero-shot learning premise for action recognition by avoiding overlap between the training and testing classes. The proposed approach significantly outperforms the state of the arts in zero-shot action recognition in terms of the the top-1 accuracy on UCF-101, HMDB-51 and ActivityNet datasets. The code and proposed experimentation setup are available in GitHub: https://github.com/Secure-and-Intelligent-Systems-Lab/SemanticVideoTransformer

updated: Fri Dec 02 2022 14:55:09 GMT+0000 (UTC)

published: Thu Mar 10 2022 05:03:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト