TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning

Linhao Qu; Shaolei Liu; Manning Wang; Shiman Li; Siqi Yin; Qin Qiao; Zhijian Song

TransFuse：自己監視学習を使用したUnifiedTransformerベースのImageFusion Framework

画像融合は、複数のソース画像からの情報を補完的な情報と統合して、単一の画像の豊かさを向上させる手法です。タスク固有のトレーニングデータとそれに対応するグラウンドトゥルースが不十分なため、既存のエンドツーエンドの画像融合方法のほとんどは、過剰適合または面倒なパラメータ最適化プロセスに簡単に陥ります。 2段階の方法では、大規模な自然画像データセットでエンコーダーデコーダーネットワークをトレーニングし、抽出された特徴を融合に利用することで、タスク固有の大量のトレーニングデータの必要性を回避しますが、自然画像とさまざまな融合タスク間のドメインギャップによりパフォーマンスが制限されます。この研究では、新しいエンコーダ-デコーダベースの画像融合フレームワークを設計し、ネットワークがタスク固有の機能を学習するように促すために、破壊-再構築ベースの自己監視トレーニングスキームを提案します。具体的には、ピクセル強度非線形変換、輝度変換、ノイズ変換に基づいて、マルチモーダル画像融合、多重露光画像融合、マルチフォーカス画像融合の3つの破壊再構成自己監視補助タスクをそれぞれ提案します。さまざまな融合タスクが相互に促進し、トレーニングされたネットワークの一般化可能性を高めるために、モデルトレーニングで自然なイメージを破壊するためにそれらの1つをランダムに選択することにより、3つの自己監視補助タスクを統合します。さらに、特徴抽出のためにCNNとTransformerを組み合わせた新しいエンコーダーを設計し、トレーニングされたモデルがローカル情報とグローバル情報の両方を活用できるようにします。マルチモーダル画像融合、多重露光画像融合、およびマルチフォーカス画像融合タスクに関する広範な実験は、提案された方法が主観的評価と客観的評価の両方で最先端のパフォーマンスを達成することを示しています。コードはまもなく公開されます。

Image fusion is a technique to integrate information from multiple source images with complementary information to improve the richness of a single image. Due to insufficient task-specific training data and corresponding ground truth, most existing end-to-end image fusion methods easily fall into overfitting or tedious parameter optimization processes. Two-stage methods avoid the need of large amount of task-specific training data by training encoder-decoder network on large natural image datasets and utilizing the extracted features for fusion, but the domain gap between natural images and different fusion tasks results in limited performance. In this study, we design a novel encoder-decoder based image fusion framework and propose a destruction-reconstruction based self-supervised training scheme to encourage the network to learn task-specific features. Specifically, we propose three destruction-reconstruction self-supervised auxiliary tasks for multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion based on pixel intensity non-linear transformation, brightness transformation and noise transformation, respectively. In order to encourage different fusion tasks to promote each other and increase the generalizability of the trained network, we integrate the three self-supervised auxiliary tasks by randomly choosing one of them to destroy a natural image in model training. In addition, we design a new encoder that combines CNN and Transformer for feature extraction, so that the trained model can exploit both local and global information. Extensive experiments on multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion tasks demonstrate that our proposed method achieves the state-of-the-art performance in both subjective and objective evaluations. The code will be publicly available soon.

updated: Wed Jan 19 2022 07:30:44 GMT+0000 (UTC)

published: Wed Jan 19 2022 07:30:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト