NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Chenfei Wu; Jian Liang; Lei Ji; Fan Yang; Yuejian Fang; Daxin Jiang; Nan Duan

NÜWA：ニューラルビジュアルワールドクリエーションのためのビジュアルシンセシス事前トレーニング

このホワイトペーパーでは、さまざまな視覚合成タスクのために、新しい視覚データ（つまり、画像やビデオ）を生成したり、操作したりできる、NÜWAと呼ばれる統合されたマルチモーダルの事前トレーニング済みモデルを紹介します。さまざまなシナリオで言語、画像、およびビデオを同時にカバーするために、3Dトランスフォーマーエンコーダー-デコーダーフレームワークが設計されています。これは、ビデオを3Dデータとして処理できるだけでなく、テキストと画像をそれぞれ1Dおよび2Dデータとして適応させることもできます。。視覚データの性質を考慮し、計算の複雑さを軽減するために、3D近傍注意（3DNA）メカニズムも提案されています。 8つのダウンストリームタスクでNÜWAを評価します。いくつかの強力なベースラインと比較して、NÜWAはテキストから画像への生成、テキストからビデオへの生成、ビデオ予測などで最先端の結果を達成します。さらに、テキストで驚くほど優れたゼロショット機能も示します。ガイド付きの画像およびビデオ操作タスク。プロジェクトリポジトリはhttps://github.com/microsoft/NUWAです。

This paper presents a unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.

updated: Wed Nov 24 2021 11:02:12 GMT+0000 (UTC)

published: Wed Nov 24 2021 11:02:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト