MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

David Junhao Zhang; Kunchang Li; Yunpeng Chen; Yali Wang; Shashwat Chandra; Yu Qiao; Luoqi Liu; Mike Zheng Shou

MorphMLP：画像とビデオ用の自己注意のないMLPのようなバックボーン

自己注意は、主要な画像およびビデオのベンチマークを支配するTransformerなどの最近のネットワークアーキテクチャの不可欠なコンポーネントになっています。これは、自己注意が長距離情報を柔軟にモデル化できるためです。同じ理由で、研究者は最近、多層パーセプトロン（MLP）を復活させ、いくつかのMLPのようなアーキテクチャを提案しようと試みており、大きな可能性を示しています。ただし、現在のMLPのようなアーキテクチャは、ローカルの詳細をキャプチャするのが苦手であり、画像やビデオのコアの詳細を段階的に理解することができません。この問題を克服するために、低レベルのレイヤーでローカルの詳細をキャプチャすることに焦点を当てる一方で、高レベルのレイヤーでの長期モデリングに焦点を当てるように徐々に変更する新しいMorphMLPアーキテクチャを提案します。具体的には、MorphFCと呼ばれる、高さと幅の次元に沿って受容野を徐々に成長させる2つのモーフィング可能なフィルターの完全に接続されたようなレイヤーを設計します。さらに興味深いことに、ビデオドメインでMorphFCレイヤーを柔軟に適応させることを提案します。私たちの知る限り、ビデオ表現を学習するためのMLPのようなバックボーンを作成したのは私たちが初めてです。最後に、画像分類、セマンティックセグメンテーション、ビデオ分類に関する広範な実験を行います。私たちのMorphMLPは、そのような自己注意のないバックボーンであり、自己注意ベースのモデルと同じくらい強力であり、それを上回ることさえできます。

Self-attention has become an integral component of the recent network architectures, e.g., Transformer, that dominate major image and video benchmarks. This is because self-attention can flexibly model long-range information. For the same reason, researchers make attempts recently to revive Multiple Layer Perceptron (MLP) and propose a few MLP-Like architectures, showing great potential. However, the current MLP-Like architectures are not good at capturing local details and lack progressive understanding of core details in the images and/or videos. To overcome this issue, we propose a novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the high-level layers. Specifically, we design a Fully-Connected-Like layer, dubbed as MorphFC, of two morphable filters that gradually grow its receptive field along the height and width dimension. More interestingly, we propose to flexibly adapt our MorphFC layer in the video domain. To our best knowledge, we are the first to create a MLP-Like backbone for learning video representation. Finally, we conduct extensive experiments on image classification, semantic segmentation and video classification. Our MorphMLP, such a self-attention free backbone, can be as powerful as and even outperform self-attention based models.

updated: Wed Nov 24 2021 14:52:20 GMT+0000 (UTC)

published: Wed Nov 24 2021 14:52:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト