FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

Noam Rotstein; David Bensaid; Shaked Brody; Roy Ganz; Ron Kimmel

FuseCap: 大規模な言語モデルを活用してビジュアルデータを強化された画像キャプションに融合する

画像キャプションは、ビジョン言語の事前トレーニング技術の出現により大幅な進歩を遂げたコンピュータービジョンの中心的なタスクです。この論文では、意味的に重要な要素をキャプチャできないことが多い、キャプションモデルの見落とされがちな制限に焦点を当てます。この欠点は、テキストと画像のデータセットにまで遡ることができます。通常、キャプションは画像コンテンツの一般的な説明を提供しますが、顕著な詳細が省略されていることがよくあります。この制限を軽減するために、私たちは FuseCap を提案します。これは、オブジェクト検出器、属性認識装置、光学式文字認識装置 (OCR) などの視覚専門家から得た追加の視覚情報でキャプションを強化するための新しい方法です。私たちのアプローチは、大規模言語モデル (LLM) を使用して、そのような視覚専門家の出力と元のキャプションを融合し、包括的な画像の説明を示す充実したキャプションを生成します。提案したキャプションエンリッチメント手法の有効性を、定量的分析と定性的分析の両方を通じて検証します。次に、私たちの方法は、大幅に少ないパラメータとトレーニングデータを使用しながら、正確で詳細なキャプションを生成するという点で現在の最先端のアプローチを上回るキャプションモデルベースのBLIPのトレーニングセットをキュレーションするために使用されます。追加の貢献として、1,200 万個の画像を強化したキャプションのペアで構成されるデータセットを提供し、提案された方法により画像とテキストの検索が大幅に改善されることを示します。

Image captioning is a central task in computer vision which has experienced substantial progress following the advent of vision-language pre-training techniques. In this paper, we highlight a frequently overlooked limitation of captioning models that often fail to capture semantically significant elements. This drawback can be traced back to the text-image datasets; while their captions typically offer a general depiction of image content, they frequently omit salient details. To mitigate this limitation, we propose FuseCap - a novel method for enriching captions with additional visual information, obtained from vision experts, such as object detectors, attribute recognizers, and Optical Character Recognizers (OCR). Our approach fuses the outputs of such vision experts with the original caption using a large language model (LLM), yielding enriched captions that present a comprehensive image description. We validate the effectiveness of the proposed caption enrichment method through both quantitative and qualitative analysis. Our method is then used to curate the training set of a captioning model based BLIP which surpasses current state-of-the-art approaches in generating accurate and detailed captions while using significantly fewer parameters and training data. As additional contributions, we provide a dataset comprising of 12M image-enriched caption pairs and show that the proposed method largely improves image-text retrieval.

updated: Sun May 28 2023 13:16:03 GMT+0000 (UTC)

published: Sun May 28 2023 13:16:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト