Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution

Jinsu Yoo; Taehoon Kim; Sihaeng Lee; Seung Hwan Kim; Honglak Lee; Tae Hyun Kim

超解像のための豊富なCNN-Transformer機能集約ネットワーク

最近のビジョントランスフォーマーは、自己注意とともに、さまざまなコンピュータービジョンタスクで有望な結果を達成しています。特に、純粋なトランスフォーマーベースの画像復元アーキテクチャは、多数のトレーニング可能なパラメーターを使用したマルチタスク事前トレーニングを使用して、既存のCNNベースの方法を上回ります。このホワイトペーパーでは、超解像（SR）タスクに効果的なハイブリッドアーキテクチャを紹介します。これは、CNNのローカル機能と、トランスフォーマーによってキャプチャされた長距離依存関係を活用して、SRの結果をさらに改善します。具体的には、トランスフォーマーブランチとコンボリューションブランチで構成されるアーキテクチャであり、2つのブランチを相互に融合して各表現を補完することにより、パフォーマンスを大幅に向上させます。さらに、トランスフォーマーがさまざまなスケールにわたるトークン間の有益な関係を効率的に活用できるようにする、クロススケールトークンアテンションモジュールを提案します。私たちの提案する方法は、多数のベンチマークデータセットで最先端のSR結果を達成します。

Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks. In particular, a pure transformer-based image restoration architecture surpasses the existing CNN-based methods using multi-task pre-training with a large number of trainable parameters. In this paper, we introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers to further improve the SR results. Specifically, our architecture comprises of transformer and convolution branches, and we substantially elevate the performance by mutually fusing two branches to complement each representation. Furthermore, we propose a cross-scale token attention module, which allows the transformer to efficiently exploit the informative relationships among tokens across different scales. Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.

updated: Wed Mar 16 2022 11:52:47 GMT+0000 (UTC)

published: Tue Mar 15 2022 06:52:25 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト