An Image is Worth 16x16 Words, What is a Video Worth?

Gilad Sharir; Asaf Noy; Lihi Zelnik-Manor

画像は16x16ワードの価値がありますが、ビデオの価値は何ですか？

アクション認識の分野における主要な方法は、入力ビデオの空間的次元と時間的次元の両方から情報を抽出しようとします。最先端（SotA）の精度に到達する方法は、通常、ビデオフレームから時間情報を抽象化する方法として3D畳み込みレイヤーを利用します。このような畳み込みを使用するには、入力ビデオから短いクリップをサンプリングする必要があります。各クリップは、厳密にサンプリングされたフレームのコレクションです。各短いクリップは入力ビデオのごく一部をカバーするため、ビデオの時間的な長さ全体をカバーするために、推論時に複数のクリップがサンプリングされます。これは計算負荷の増加につながり、実際のアプリケーションには実用的ではありません。推論に必要なフレーム数を大幅に削減することで、計算上のボトルネックに対処します。私たちのアプローチは、ビデオフレームにグローバルな注意を適用する時間トランスフォーマーに依存しているため、各フレームの顕著な情報をより適切に活用します。したがって、私たちのアプローチは非常に入力効率が高く、データの一部（ビデオあたりのフレーム数）、計算、および待ち時間で（Kineticsデータセットで）SotAの結果を達成できます。特にKinetics-400では、ビデオあたりのフレーム数が30倍少なく、現在の主要な方法よりも推論が40倍速く、78.8のトップ1精度に達します。コードはhttps://github.com/Alibaba-MIIL/STAMで入手できます。

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method. Code is available at: https://github.com/Alibaba-MIIL/STAM

updated: Thu Mar 25 2021 15:25:17 GMT+0000 (UTC)

published: Thu Mar 25 2021 15:25:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト