Training Diffusion Models with Reinforcement Learning

Kevin Black; Michael Janner; Yilun Du; Ilya Kostrikov; Sergey Levine

強化学習による拡散モデルのトレーニング

拡散モデルは、対数尤度目標への近似を使用してトレーニングされた柔軟な生成モデルのクラスです。ただし、拡散モデルのほとんどのユースケースは、可能性ではなく、人間が認識する画質や薬の有効性などの下流の目的に関係しています。この論文では、そのような目的のために拡散モデルを直接最適化するための強化学習方法を調査します。ノイズ除去を複数ステップの意思決定問題として設定することで、代替の報酬重み付け尤度アプローチよりも効果的な、ノイズ除去拡散ポリシー最適化 (DDPO) と呼ぶ、ある種のポリシー勾配アルゴリズムがどのように可能になるかを説明します。経験的に、DDPO はテキストから画像への拡散モデルを、画像の圧縮率などのプロンプトでは表現が難しい目標や、美的品質など人間のフィードバックから得られる目標に適応させることができます。最後に、追加のデータ収集や人間による注釈を必要とせずに、視覚言語モデルからのフィードバックを使用して、DDPO がプロンプト画像の位置合わせを改善できることを示します。

Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation.

updated: Mon May 22 2023 17:57:41 GMT+0000 (UTC)

published: Mon May 22 2023 17:57:41 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト