Text-Only Image Captioning with Multi-Context Data Generation

Feipeng Ma; Yizhou Zhou; Fengyun Rao; Yueyi Zhang; Xiaoyan Sun

マルチコンテキストデータ生成によるテキストのみの画像キャプション

テキストのみの画像キャプション (TIC) は、画像を正確に説明できるテキストのみに基づいてモデルを構築することを目的としたアプローチです。最近、拡散モデルは、特定のテキストと意味的に一貫した高品質の画像を生成する際の顕著な能力を実証しました。これは、TIC の合成トレーニング画像を生成する機会を提供します。ただし、単純な説明から生成された画像は、通常、1 つまたは限られたコンテキストを持つ単一の視点を示しており、画像領域における現実世界のシーンの複雑さと一致していないという課題を私たちは特定しました。この論文では、マルチコンテキストデータ生成を導入することでこの問題に対処する新しいフレームワークを提案します。最初のテキストコーパスから始めて、私たちのフレームワークは大規模な言語モデルを採用して、さまざまな視点から同じシーンを説明する複数の文を選択します。これらの文は、複数のコンテキストを含む 1 つの文に要約されます。簡単な文章からは単純な画像を、要約した文章からは複雑な画像を拡散モデルを通じて生成します。最後に、このプロセスから得られた合成画像とテキストのペアのみを使用してモデルをトレーニングします。実験結果は、私たちが提案したフレームワークが私たちが特定した中心的な課題に効果的に取り組み、MSCOCO、Flickr30k、SS1M などの一般的なデータセットで最先端のパフォーマンスを達成していることを示しています。

Text-only Image Captioning (TIC) is an approach that aims to construct a model solely based on text that can accurately describe images. Recently, diffusion models have demonstrated remarkable capabilities in generating high-quality images that are semantically coherent with given texts. This presents an opportunity to generate synthetic training images for TIC. However, we have identified a challenge that the images generated from simple descriptions typically exhibit a single perspective with one or limited contexts, which is not aligned with the complexity of real-world scenes in the image domain. In this paper, we propose a novel framework that addresses this issue by introducing multi-context data generation. Starting with an initial text corpus, our framework employs a large language model to select multiple sentences that describe the same scene from various perspectives. These sentences are then summarized into a single sentence with multiple contexts. We generate simple images using the straightforward sentences and complex images using the summarized sentences through diffusion models. Finally, we train the model exclusively using the synthetic image-text pairs obtained from this process. Experimental results demonstrate that our proposed framework effectively tackles the central challenge we have identified, achieving the state-of-the-art performance on popular datasets such as MSCOCO, Flickr30k, and SS1M.

updated: Mon May 29 2023 13:18:59 GMT+0000 (UTC)

published: Mon May 29 2023 13:18:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト