Exploring Discrete Diffusion Models for Image Captioning

Zixin Zhu; Yixuan Wei; Jianfeng Wang; Zhe Gan; Zheng Zhang; Le Wang; Gang Hua; Lijuan Wang; Zicheng Liu; Han Hu

画像キャプション用の離散拡散モデルの探索

画像キャプションタスクは通常、テキストトークンを 1 つずつデコードする自己回帰メソッドによって実現されます。 DDCap という名前の拡散ベースのキャプションモデルを提示して、デコードの柔軟性を高めます。出力が固定長で連続的で冗長な画像生成とは異なり、画像キャプションのテキストはカテゴリ別で短く、さまざまな長さです。したがって、実験で示されているように、離散拡散モデルを単純にテキストのデコードに適用してもうまくいきません。パフォーマンスのギャップに対処するために、ベストファースト推論、集中注意マスク、テキスト長予測、画像を使用しないトレーニングなど、いくつかの重要な手法を提案します。追加のキャプション事前トレーニングなしの COCO では、117.8 の CIDEr スコアを達成します。これは、制御された設定で同じアーキテクチャを使用した自己回帰ベースラインより +5.0 高い値です。また、キャプション埋め込みタスクで、自己回帰ベースライン (230.3 対 203.5) よりも +26.8 高い CIDEr スコアを実行します。 4M ビジョン言語の事前トレーニング画像と基本サイズのモデルを使用して、COCO で 125.1 の CIDEr スコアに達しました。これは、よく開発された最高の自己回帰フレームワークに匹敵します。コードは https://github.com/buxiangzhiren/DDCap で入手できます。

The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.

updated: Mon Nov 21 2022 18:12:53 GMT+0000 (UTC)

published: Mon Nov 21 2022 18:12:53 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト