Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Yingchen Yu; Fangneng Zhan; Rongliang Wu; Jianxiong Pan; Kaiwen Cui; Shijian Lu; Feiying Ma; Xuansong Xie; Chunyan Miao

双方向および自己回帰トランスフォーマーを使用した多様な画像修復

画像の修復は、劣決定の逆問題であり、欠落した領域や破損した領域を合理的かつ現実的に埋める多様なコンテンツを自然に許可します。畳み込みニューラルネットワーク（CNN）を使用する一般的なアプローチでは、視覚的に快適なコンテンツを合成できますが、CNNは、グローバルな特徴をキャプチャするための知覚フィールドが限られています。画像レベルの注意を払うことで、トランスフォーマーは、ピクセルシーケンス分布の自己回帰モデリングを使用して、長距離の依存関係をモデル化し、多様なコンテンツを生成することができます。ただし、破損した領域は任意の方向からのコンテキストで任意の形状を持つ可能性があるため、トランスフォーマーでの一方向の注意は最適ではありません。 BAT-Fillを提案します。これは、多様な修復コンテンツの自己回帰生成のための深い双方向コンテキストをモデル化する、新しい双方向自己回帰トランスフォーマー（BAT）を備えた画像修復フレームワークです。 BAT-Fillは、トランスフォーマーとCNNのメリットを2段階で継承します。これにより、トランスフォーマーでの注意の2次の複雑さに制約されることなく、高解像度のコンテンツを生成できます。具体的には、最初にトランスフォーマーを適応させることによって低解像度の多元的な画像構造を生成し、次にCNNベースのアップサンプリングネットワークを使用して高解像度のリアルなテクスチャ詳細を合成します。複数のデータセットにわたる広範な実験により、BAT-Fillは、定性的および定量的に画像修復において優れた多様性と忠実度を達成することが示されています。

Image inpainting is an underdetermined inverse problem, it naturally allows diverse contents that fill up the missing or corrupted regions reasonably and realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable to model long-range dependencies and generate diverse contents with autoregressive modeling of pixel-sequence distributions. However, the unidirectional attention in transformers is suboptimal as corrupted regions can have arbitrary shapes with contexts from arbitrary directions. We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT) that models deep bidirectional contexts for autoregressive generation of diverse inpainting contents. BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers. Specifically, it first generates pluralistic image structures of low resolution by adapting transformers and then synthesizes realistic texture details of high resolutions with a CNN-based up-sampling network. Extensive experiments over multiple datasets show that BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively and quantitatively.

updated: Mon Apr 26 2021 03:52:27 GMT+0000 (UTC)

published: Mon Apr 26 2021 03:52:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト