Reduce Information Loss in Transformers for Pluralistic Image Inpainting

Qiankun Liu; Zhentao Tan; Dongdong Chen; Qi Chu; Xiyang Dai; Yinpeng Chen; Mengchen Liu; Lu Yuan; Nenghai Yu

複数の画像修復用のトランスフォーマーでの情報損失を削減

トランスフォーマーは、最近、多元的な画像修復で大きな成功を収めています。ただし、既存のトランスベースのソリューションは各ピクセルをトークンと見なすため、2つの側面から情報損失の問題が発生します。1）効率を考慮して入力画像をはるかに低い解像度にダウンサンプリングし、情報損失とマスクされた領域。 2）256 ^ 3 RGBピクセルを少数（512など）の量子化ピクセルに量子化します。量子化されたピクセルのインデックスは、トランスフォーマーの入力および予測ターゲットのトークンとして使用されます。低解像度の結果をアップサンプリングして改良するために追加のCNNネットワークが使用されますが、失われた情報を取り戻すことは困難です。入力情報を可能な限り保持するために、新しいトランスベースのフレームワーク「PUT」を提案します。具体的には、計算効率を維持しながら入力のダウンサンプリングを回避するために、パッチベースのオートエンコーダーP-VQVAEを設計します。この場合、エンコーダーはマスクされた画像を重複しないパッチトークンに変換し、デコーダーはマスクされた領域を修復されたトークンから復元します。マスクされていない領域は変更されません。量子化による情報損失を排除するために、量子化されていないトランスフォーマー（UQ-Transformer）が適用されます。これは、量子化なしでP-VQVAEエンコーダーからの機能を入力として直接取得し、量子化されたトークンのみを予測ターゲットと見なします。広範な実験により、PUTは、特に大きなマスク領域や複雑な大規模データセットの場合、画像の忠実度に関して最先端の方法を大幅に上回っていることを示しています。

Transformers have achieved great success in pluralistic image inpainting recently. However, we find existing transformer based solutions regard each pixel as a token, thus suffer from information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration, incurring information loss and extra misalignment for the boundaries of masked regions. 2) They quantize 256^3 RGB pixels to a small number (such as 512) of quantized pixels. The indices of quantized pixels are used as tokens for the inputs and prediction targets of transformer. Although an extra CNN network is used to upsample and refine the low-resolution results, it is difficult to retrieve the lost information back.To keep input information as much as possible, we propose a new transformer based framework "PUT". Specifically, to avoid input downsampling while maintaining the computation efficiency, we design a patch-based auto-encoder P-VQVAE, where the encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by quantization, an Un-Quantized Transformer (UQ-Transformer) is applied, which directly takes the features from P-VQVAE encoder as input without quantization and regards the quantized tokens only as prediction targets. Extensive experiments show that PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets.

updated: Tue May 10 2022 17:59:58 GMT+0000 (UTC)

published: Tue May 10 2022 17:59:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト