An Image Classifier Can Suffice Video Understanding

Quanfu Fan; Chun-Fu; Chen; Rameswar Panda

画像分類器はビデオの理解で十分です

ビデオ認識問題を画像認識タスクとしてキャストすることにより、ビデオ理解に関する新しい視点を提案します。画像分類器だけで、時間的モデリングなしでビデオを理解するのに十分であることを示します。私たちのアプローチはシンプルで普遍的です。画像を分類するのとまったく同じ方法で、入力フレームをスーパーイメージに構成して、アクション認識のタスクを実行するように画像分類子をトレーニングします。最近開発されたビジョントランスフォーマーを使用して、Kinetics400、Something-to-something（V2）、MiT、Jesterを含む4つの公開データセットで強力で有望なパフォーマンスを実証することにより、このようなアイデアの実行可能性を証明します。また、コンピュータビジョンで普及しているResNet画像分類器を実験して、アイデアをさらに検証します。 Kinetics400での結果は、時空間モデリングに基づいた、最もパフォーマンスの高いCNNアプローチのいくつかに匹敵します。コードとモデルはhttps://github.com/IBM/sifar-pytorchで入手できます。

We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Kinetics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.

updated: Sat Jun 26 2021 22:28:30 GMT+0000 (UTC)

published: Sat Jun 26 2021 22:28:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト