TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

Yifan Jiang; Shiyu Chang; Zhangyang Wang

TransGAN：2つの純粋なトランスフォーマーが1つの強力なGANを作成でき、それがスケールアップできます

トランスフォーマーに対する最近の爆発的な関心は、分類、検出、セグメンテーションなどのコンピュータービジョンタスクの強力な「ユニバーサル」モデルになる可能性を示唆しています。これらの試みは主に識別モデルを研究していますが、いくつかのより悪名高いビジョンでトランスフォーマーを探索します生成的敵対的ネットワーク（GAN）などのタスク。私たちの目標は、純粋な変圧器ベースのアーキテクチャのみを使用して、完全に畳み込みのないGANを構築する最初のパイロット研究を実施することです。機能の解像度を段階的に向上させるフレンドリーなトランスベースのジェネレーターと、それに対応してセマンティックコンテキストと低レベルのテクスチャを同時にキャプチャするマルチスケールディスクリミネーターに加えて、メモリのボトルネックをさらに軽減するためのグリッド自己注意の新しいモジュールを紹介します、TransGANを高解像度生成にスケールアップするために。独自のトレーニングレックも開発しています。データ拡張、修正された正規化、相対位置エンコーディングなど、TransGANのトレーニングの不安定性の問題を軽減できる一連の手法を含むipe。当社の最高のアーキテクチャは、畳み込みバックボーンを使用する現在の最先端のGANと比較して、非常に競争力のあるパフォーマンスを実現します。具体的には、TransGANはSTL-10で10.43の新しい最先端の開始スコアと18.28のFIDを設定し、StyleGAN-V2を上回っています。 CelebA-HQやLSUN-Churchなどの高解像度（256 x 256など）の生成タスクに関しては、TransGANは、忠実度が高く印象的なテクスチャの詳細を備えた多様な視覚的な例を作成し続けています。さらに、トレーニングダイナミクスを視覚化することにより、トランスベースの生成モデルを深く掘り下げて、それらの動作が畳み込みモデルとどのように異なるかを理解します。コードはhttps://github.com/VITA-Group/TransGANで入手できます。

The recent explosive interest on transformers has suggested their potential to become powerful ``universal" models for computer vision tasks, such as classification, detection, and segmentation. While those attempts mainly study the discriminative models, we explore transformers on some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs). Our goal is to conduct the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution, and correspondingly a multi-scale discriminator to capture simultaneously semantic contexts and low-level textures. On top of them, we introduce the new module of grid self-attention for alleviating the memory bottleneck further, in order to scale up TransGAN to high-resolution generation. We also develop a unique training recipe including a series of techniques that can mitigate the training instability issues of TransGAN, such as data augmentation, modified normalization, and relative position encoding. Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs using convolutional backbones. Specifically, TransGAN sets new state-of-the-art inception score of 10.43 and FID of 18.28 on STL-10, outperforming StyleGAN-V2. When it comes to higher-resolution (e.g. 256 x 256) generation tasks, such as on CelebA-HQ and LSUN-Church, TransGAN continues to produce diverse visual examples with high fidelity and impressive texture details. In addition, we dive deep into the transformer-based generation models to understand how their behaviors differ from convolutional ones, by visualizing training dynamics. The code is available at https://github.com/VITA-Group/TransGAN.

updated: Mon Jun 14 2021 03:19:23 GMT+0000 (UTC)

published: Sun Feb 14 2021 05:24:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト