UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

Kunchang Li; Yali Wang; Peng Gao; Guanglu Song; Yu Liu; Hongsheng Li; Yu Qiao

UniFormer：効率的な時空間表現学習のための統合トランスフォーマー

ビデオフレーム間の大きなローカル冗長性と複雑なグローバル依存性のために、高次元ビデオからリッチでマルチスケールの時空間セマンティクスを学習することは困難な作業です。この研究の最近の進歩は、主に3D畳み込みニューラルネットワークとビジョントランスフォーマーによって推進されています。 3D畳み込みは、ローカルコンテキストを効率的に集約して、小さな3D近隣からのローカル冗長性を抑制できますが、受容野が限られているため、グローバル依存関係をキャプチャする機能がありません。あるいは、ビジョントランスフォーマーは、自己注意メカニズムによって長距離の依存関係を効果的にキャプチャできますが、各レイヤーのすべてのトークン間のブラインド類似性の比較により、ローカルの冗長性を減らすという制限があります。これらの観察に基づいて、3D畳み込みと時空間自己注意のメリットを簡潔なトランスフォーマー形式にシームレスに統合し、計算と精度の好ましいバランスを実現する新しいユニファイドトランスフォーマー（UniFormer）を提案します。従来のトランスフォーマーとは異なり、リレーションアグリゲーターは、浅いレイヤーと深いレイヤーでそれぞれローカルトークンとグローバルトークンのアフィニティを学習することで、時空間の冗長性と依存関係の両方に取り組むことができます。 Kinetics-400、Kinetics-600、Something-Something V1＆V2など、人気のあるビデオベンチマークで広範な実験を行っています。 ImageNet-1Kの事前トレーニングのみで、UniFormerはKinetics-400 / Kinetics-600で82.9％/ 84.8％のトップ1精度を達成し、他の最先端の方法よりも10分の1のGFLOPを必要とします。 Something-Something V1およびV2の場合、UniFormerは、それぞれ60.9％および71.2％のトップ1精度の新しい最先端のパフォーマンスを実現します。コードはhttps://github.com/Sense-X/UniFormerで入手できます。

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively. Code is available at https://github.com/Sense-X/UniFormer.

updated: Mon Jan 24 2022 04:40:46 GMT+0000 (UTC)

published: Wed Jan 12 2022 20:02:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト