MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

David Junhao Zhang; Kunchang Li; Yali Wang; Yunpeng Chen; Shashwat Chandra; Yu Qiao; Luoqi Liu; Mike Zheng Shou

MorphMLP: 時空間表現学習のための効率的な MLP のようなバックボーン

最近、MLP のようなネットワークが画像認識のために復活しました。ただし、大きな計算負荷を伴う複雑な時空間モデリングのため、ビデオドメインで一般的な MLP のようなアーキテクチャを構築できるかどうかは調査されていません。このギャップを埋めるために、ビデオ表現学習のために簡潔な完全接続 (FC) レイヤーを柔軟に活用する効率的なセルフアテンションフリーバックボーン、つまり MorphMLP を提示します。具体的には、MorphMLP ブロックは、空間モデリングと時間モデリングのそれぞれに対応する連続した 2 つのキーレイヤ、つまり MorphFC_s と MorphFC_t で構成されます。 MorphFC_s は、高さと幅の両方の次元に沿った漸進的なトークンの相互作用によって、各フレームのコアセマンティクスを効果的にキャプチャできます。あるいは、MorphFC_t は、各空間位置での一時的なトークンの集約によって、フレームに対する長期的な依存関係を適応的に学習できます。このような多次元および多スケールの因数分解により、MorphMLP ブロックは優れた精度と計算のバランスを実現できます。最後に、多くの一般的なビデオベンチマークで MorphMLP を評価します。最近の最先端のモデルと比較して、MorphMLP は計算を大幅に削減しますが、精度は向上します。たとえば、MorphMLP-S は VideoSwin-T の 50% の GFLOP のみを使用しますが、Kinetics400 では 0.9% のトップ 1 の改善を達成します (ImageNet1K 事前トレーニングの下)。 . MorphMLP-B は、MViT-B の 43% の GFLOP のみを使用しますが、MorphMLP-B が ImageNet1K で事前トレーニングされ、MViT-B が Kinetics400 で事前トレーニングされているにもかかわらず、SSV2 で 2.4% のトップ 1 の改善を達成します。さらに、画像ドメインに適応した私たちの方法は、以前のSOTA MLPのようなアーキテクチャよりも優れています。コードは https://github.com/MTLab/MorphMLP で入手できます。

Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at https://github.com/MTLab/MorphMLP.

updated: Tue Aug 23 2022 12:05:19 GMT+0000 (UTC)

published: Wed Nov 24 2021 14:52:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト