Can An Image Classifier Suffice For Action Recognition?

Quanfu Fan; Chun-Fu; Chen; Rameswar Panda

画像分類器は行動認識に十分でしょうか？

ビデオ認識の問題を画像認識タスクとしてキャストすることにより、ビデオ理解の新しい視点を探ります。私たちのアプローチは、入力ビデオフレームをスーパーイメージに再配置します。これにより、イメージ分類とまったく同じ方法で、アクション認識のタスクを実行するためにイメージ分類子を直接トレーニングできます。このような単純なアイデアで、トランスフォーマーベースの画像分類器だけでアクション認識に十分であることを示します。特に、私たちのアプローチは、Kinetics400、Moments In Time、Something-Something V2（SSV2）、Jester、Diving48などのいくつかの公開データセットでSOTAメソッドに対して強力で有望なパフォーマンスを示しています。また、コンピュータビジョンで普及しているResNet画像分類器を試して、アイデアをさらに検証します。 Kinetics400とSSV2の両方の結果は、時空間モデリングに基づく最高のパフォーマンスを発揮するCNNアプローチのいくつかに匹敵します。ソースコードとモデルはhttps://github.com/IBM/sifar-pytorchで入手できます。

We explore a new perspective on video understanding by casting the video recognition problem as an image recognition task. Our approach rearranges input video frames into super images, which allow for training an image classifier directly to fulfill the task of action recognition, in exactly the same way as image classification. With such a simple idea, we show that transformer-based image classifiers alone can suffice for action recognition. In particular, our approach demonstrates strong and promising performance against SOTA methods on several public datasets including Kinetics400, Moments In Time, Something-Something V2 (SSV2), Jester and Diving48. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on both Kinetics400 and SSV2 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. Our source codes and models are available at https://github.com/IBM/sifar-pytorch.

updated: Mon Apr 25 2022 18:34:03 GMT+0000 (UTC)

published: Sat Jun 26 2021 22:28:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト