TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

Yifan Jiang; Shiyu Chang; Zhangyang Wang

TransGAN：2つの純粋なトランスフォーマーが1つの強力なGANを作成でき、それがスケールアップできます

変圧器に対する最近の爆発的な関心は、分類、検出、セグメンテーションなどのコンピュータービジョンタスクの強力な「ユニバーサル」モデルになる可能性を示唆しています。これらの試みは主に識別モデルを研究しますが、生成的敵対的ネットワーク（GAN）など、より悪名高いビジョンタスクでトランスフォーマーを調査します。私たちの目標は、純粋なトランスベースのアーキテクチャのみを使用して、畳み込みのないGANを構築する最初のパイロットスタディを実施することです。 TransGANと呼ばれる私たちのバニラGANアーキテクチャは、機能の解像度を段階的に向上させるメモリフレンドリーなトランスベースのジェネレータと、それに対応してセマンティックコンテキストと低レベルのテクスチャを同時にキャプチャするマルチスケールディスクリミネータで構成されています。さらに、TransGANを高解像度の世代にスケールアップするために、メモリのボトルネックをさらに軽減するためのグリッド自己注意の新しいモジュールを紹介します。また、データ拡張、修正された正規化、相対位置エンコーディングなど、TransGANのトレーニングの不安定性の問題を軽減できる一連の手法を含む独自のトレーニングレシピを開発します。当社の最高のアーキテクチャは、畳み込みバックボーンを使用する現在の最先端のGANと比較して非常に競争力のあるパフォーマンスを実現します。具体的には、TransGANはSTL-10で10.43の新しい最先端の開始スコアと18.28のFIDを設定し、StyleGAN-V2を上回っています。 CelebA-HQやLSUN-Churchなどの高解像度（256 x 256など）の生成タスクに関しては、TransGANは、忠実度が高く印象的なテクスチャの詳細を備えた多様な視覚的な例を作成し続けています。さらに、トレーニングダイナミクスを視覚化することにより、トランスベースの生成モデルを深く掘り下げて、それらの動作が畳み込みモデルとどのように異なるかを理解します。コードはhttps://github.com/VITA-Group/TransGANで入手できます。

The recent explosive interest on transformers has suggested their potential to become powerful "universal" models for computer vision tasks, such as classification, detection, and segmentation. While those attempts mainly study the discriminative models, we explore transformers on some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs). Our goal is to conduct the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution, and correspondingly a multi-scale discriminator to capture simultaneously semantic contexts and low-level textures. On top of them, we introduce the new module of grid self-attention for alleviating the memory bottleneck further, in order to scale up TransGAN to high-resolution generation. We also develop a unique training recipe including a series of techniques that can mitigate the training instability issues of TransGAN, such as data augmentation, modified normalization, and relative position encoding. Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs using convolutional backbones. Specifically, TransGAN sets new state-of-the-art inception score of 10.43 and FID of 18.28 on STL-10, outperforming StyleGAN-V2. When it comes to higher-resolution (e.g. 256 x 256) generation tasks, such as on CelebA-HQ and LSUN-Church, TransGAN continues to produce diverse visual examples with high fidelity and impressive texture details. In addition, we dive deep into the transformer-based generation models to understand how their behaviors differ from convolutional ones, by visualizing training dynamics. The code is available at https://github.com/VITA-Group/TransGAN.

updated: Thu Dec 09 2021 04:30:20 GMT+0000 (UTC)

published: Sun Feb 14 2021 05:24:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト