MAXIM: Multi-Axis MLP for Image Processing

Zhengzhong Tu; Hossein Talebi; Han Zhang; Feng Yang; Peyman Milanfar; Alan Bovik; Yinxiao Li

MAXIM：画像処理用の多軸MLP

トランスフォーマーと多層パーセプトロン（MLP）モデルの最近の進歩により、コンピュータービジョンタスク用の新しいネットワークアーキテクチャ設計が提供されます。これらのモデルは、画像認識などの多くの視覚タスクで効果的であることが証明されていますが、低レベルの視覚に適応させるには課題が残っています。高解像度画像をサポートする柔軟性の欠如とローカルな注意の制限は、おそらく画像復元でTransformersとMLPを使用する際の主なボトルネックです。この作業では、MAXIMと呼ばれる多軸MLPベースのアーキテクチャを紹介します。これは、画像処理タスクの効率的で柔軟な汎用ビジョンバックボーンとして機能します。 MAXIMは、UNet型の階層構造を使用し、空間的にゲートされたMLPによって可能になる長距離の相互作用をサポートします。具体的には、MAXIMには2つのMLPベースのビルディングブロックが含まれています。ローカルとグローバルの視覚的手がかりの効率的でスケーラブルな空間混合を可能にする多軸ゲートMLPと、クロスアテンションの代替であるクロスゲーティングブロックです。 -相互調整機能。これらのモジュールは両方ともMLPのみに基づいていますが、画像処理に望ましい2つのプロパティであるグローバルかつ「完全畳み込み」の両方のメリットもあります。私たちの広範な実験結果は、提案されたMAXIMモデルが、ノイズ除去、ブレ除去、ドレイン除去、デヘイズ、エンハンスメントなど、さまざまな画像処理タスクにわたって10を超えるベンチマークで最先端のパフォーマンスを達成し、必要な数が少ないか同等であることを示しています。競合モデルよりもパラメータとFLOP。

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks for using Transformers and MLPs in image restoration. In this work we present a multi-axis MLP based architecture, called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature mutual conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and `fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models.

updated: Sun Jan 09 2022 09:59:32 GMT+0000 (UTC)

published: Sun Jan 09 2022 09:59:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト