SVT: Supertoken Video Transformer for Efficient Video Understanding

Chenbin Pan; Rui Hou; Hanchao Yu; Qifan Wang; Senem Velipasalar; Madian Khabsa

SVT: 効率的なビデオ理解のための Supertoken Video Transformer

ビデオを最初から最後まで固定解像度で処理するか、プーリングおよびダウンスケーリング戦略を組み込むかによって、既存のビデオトランスフォーマーは、冗長な情報の大部分を特別に処理することなく、ネットワーク全体のビデオコンテンツ全体を処理します。このホワイトペーパーでは、セマンティックプーリングモジュール (SPM) を組み込んだ Supertoken Video Transformer (SVT) を提示し、潜在表現をセマンティクスに基づいてビジュアルトランスフォーマーの深度に沿って集約し、ビデオ入力に固有の冗長性を削減します。~定性的な結果私たちの方法は、潜在表現を類似のセマンティクスとマージすることで効果的に冗長性を減らし、下流のタスクの顕著な情報の割合を増やすことができることを示しています.~定量的に、私たちの方法は ViT と MViT の両方のパフォーマンスを向上させる一方で、キネクティクスと何かで必要な計算を大幅に減らします。 -Something-V2 ベンチマーク。~具体的には、当社の SPM を使用して、MAE で事前トレーニングされた ViT-B と ViT-L の精度を、それぞれ 33% 少ない GFLOP で 1.5%、55% 少ない FLOP で 0.2% 向上させます。 Kinectics-400 ベンチマークで、MViTv2-B の精度をそれぞれ 0.2% および 0.3% 改善し、Kinectics-400 および Something-Something-V2 で GFLOP を 22% 削減します。

Whether by processing videos with fixed resolution from start to end or incorporating pooling and down-scaling strategies, existing video transformers process the whole video content throughout the network without specially handling the large portions of redundant information. In this paper, we present a Supertoken Video Transformer (SVT) that incorporates a Semantic Pooling Module (SPM) to aggregate latent representations along the depth of visual transformer based on their semantics, and thus, reduces redundancy inherent in video inputs.~Qualitative results show that our method can effectively reduce redundancy by merging latent representations with similar semantics and thus increase the proportion of salient information for downstream tasks.~Quantitatively, our method improves the performance of both ViT and MViT while requiring significantly less computations on the Kinectics and Something-Something-V2 benchmarks.~More specifically, with our SPM, we improve the accuracy of MAE-pretrained ViT-B and ViT-L by 1.5% with 33% less GFLOPs and by 0.2% with 55% less FLOPs, respectively, on the Kinectics-400 benchmark, and improve the accuracy of MViTv2-B by 0.2% and 0.3% with 22% less GFLOPs on Kinectics-400 and Something-Something-V2, respectively.

updated: Sun Apr 23 2023 21:42:25 GMT+0000 (UTC)

published: Sat Apr 01 2023 14:31:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト