Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

Junyang Wang; Yi Zhang; Ming Yan; Ji Zhang; Jitao Sang

アンカー増強視覚言語空間アラインメントによるゼロショット画像キャプション

CLIP (Contrasive Language-Image Pre-Training) は、視覚的分類や画像検索などのクロスモーダル相関タスクにおいて、驚くべきゼロショット転送機能を示しています。ただし、ゼロショット画像キャプションなどのクロスモーダル生成タスクでのパフォーマンスは、依然として満足のいくものではありません。この作業では、ゼロショット画像キャプションにCLIPを直接採用することは、コンテキスト内のテキストモダリティに大きく依存し、以前はコンテキスト言語と呼んでいる視覚情報をほとんど無視していることについて説明します.これに対処するために、教師なしのクロスモーダル学習を促進するためのクロスモーダル言語モデル (CLM) を提案します。さらに、生成モデルの注意を CLIP の表現におけるきめ細かい情報に導くために、アンカーオーグメントを提案します。 MS COCO と Flickr 30K での実験により、キャプションの品質と計算効率の両方において、提案されたアプローチの有望なパフォーマンスが検証されました。

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero-shot transfer capabilities in cross-modal correlation tasks such as visual classification and image retrieval. However, its performance in cross-modal generation tasks like zero-shot image captioning remains unsatisfied. In this work, we discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information, which we call contextual language prior. To address this, we propose Cross-modal Language Models (CLMs) to facilitate unsupervised cross-modal learning. We further propose Anchor Augment to guide the generative model's attention to the fine-grained information in the representation of CLIP. Experiments on MS COCO and Flickr 30K validate the promising performance of proposed approach in both captioning quality and computational efficiency.

updated: Mon Nov 14 2022 11:12:19 GMT+0000 (UTC)

published: Mon Nov 14 2022 11:12:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト