ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

Changyao Tian; Chenxin Tao; Jifeng Dai; Hao Li; Ziheng Li; Lewei Lu; Xiaogang Wang; Hongsheng Li; Gao Huang; Xizhou Zhu

ADDP: 交互ノイズ除去拡散プロセスを使用した画像認識および生成のための一般的な表現の学習

画像認識と生成は長い間、互いに独立して開発されてきました。最近の汎用表現学習の傾向に伴い、認識タスクと生成タスクの両方に対する一般表現の開発も促進されています。ただし、予備的な試みは主に生成パフォーマンスに焦点を当てていますが、認識タスクに関してはまだ劣っています。これらの方法はベクトル量子化 (VQ) 空間でモデル化されますが、主要な認識方法は入力としてピクセルを使用します。私たちの重要な洞察は 2 つあります。(1) 入力としてのピクセルは認識タスクにとって重要です。 (2) 再構成対象としての VQ トークンは、生成タスクに有益です。これらの観察は、単一の表現学習フレームワーク内でこれら 2 つの空間を統合する交互ノイズ除去拡散プロセス (ADDP) を提案する動機となっています。各ノイズ除去ステップでは、私たちの方法はまず以前の VQ トークンからピクセルをデコードし、次にデコードされたピクセルから新しい VQ トークンを生成します。拡散プロセスでは、VQ トークンの一部が徐々にマスクされてトレーニングサンプルが構築されます。学習された表現は、さまざまな高忠実度画像の生成に使用でき、認識タスクで優れた転送パフォーマンスを発揮することもできます。広範な実験により、私たちの手法が無条件生成、ImageNet 分類、COCO 検出、および ADE20k セグメンテーションにおいて競争力のあるパフォーマンスを達成することが示されています。重要なのは、私たちの方法は、生成タスクと高密度認識タスクの両方に適用できる一般表現の開発に初めて成功したことを表しています。コードは公開されます。

Image recognition and generation have long been developed independently of each other. With the recent trend towards general-purpose representation learning, the development of general representations for both recognition and generation tasks is also promoted. However, preliminary attempts mainly focus on generation performance, but are still inferior on recognition tasks. These methods are modeled in the vector-quantized (VQ) space, whereas leading recognition methods use pixels as inputs. Our key insights are twofold: (1) pixels as inputs are crucial for recognition tasks; (2) VQ tokens as reconstruction targets are beneficial for generation tasks. These observations motivate us to propose an Alternating Denoising Diffusion Process (ADDP) that integrates these two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks. Extensive experiments show that our method achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation. Importantly, our method represents the first successful development of general representations applicable to both generation and dense recognition tasks. Code shall be released.

updated: Thu Jun 08 2023 17:59:32 GMT+0000 (UTC)

published: Thu Jun 08 2023 17:59:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト