Paraphrasing Is All You Need for Novel Object Captioning

Cheng-Fu Yang; Yao-Hung Hubert Tsai; Wan-Cyuan Fan; Ruslan Salakhutdinov; Louis-Philippe Morency; Yu-Chiang Frank Wang

新しいオブジェクトのキャプションに必要なのは言い換えだけです

新しいオブジェクトキャプション (NOC) は、トレーニング中にグラウンドトゥルースキャプションを観察することなく、オブジェクトを含む画像を記述することを目的としています。キャプションアノテーションがないため、シーケンス間トレーニングまたは CIDEr 最適化を介してキャプションモデルを直接最適化することはできません。その結果、言い換えによって出力キャプションをヒューリスティックに最適化する NOC の 2 段階の学習フレームワークである Paraphrasing-to-Captioning (P2C) を提示します。 P2C を使用すると、キャプションモデルは最初に、テキストのみのコーパスで事前にトレーニングされた言語モデルから言い換えを学習し、言語の流暢さを向上させるための単語バンクの拡張を可能にします。入力画像の視覚的コンテンツを十分に説明する出力キャプションをさらに強化するために、キャプションモデルの自己言い換えを実行し、忠実度と妥当性の目標を導入しました。トレーニング中に新しいオブジェクト画像に使用できるグラウンドトゥルースキャプションがないため、P2C はクロスモダリティ (画像-テキスト) 関連付けモジュールを活用して、上記のキャプション特性を適切に保持できるようにします。実験では、P2C が nocaps と COCO Caption データセットで最先端のパフォーマンスを達成することを示すだけでなく、NOC の言語とクロスモダリティ関連モデルを置き換えることで、学習フレームワークの有効性と柔軟性も検証します。実装の詳細とコードは、補足資料で入手できます。

Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training. Due to the absence of caption annotation, captioning models cannot be directly optimized via sequence-to-sequence training or CIDEr optimization. As a result, we present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which would heuristically optimize the output captions via paraphrasing. With P2C, the captioning model first learns paraphrasing from a language model pre-trained on text-only corpus, allowing expansion of the word bank for improving linguistic fluency. To further enforce the output caption sufficiently describing the visual content of the input image, we perform self-paraphrasing for the captioning model with fidelity and adequacy objectives introduced. Since no ground truth captions are available for novel object images during training, our P2C leverages cross-modality (image-text) association modules to ensure the above caption characteristics can be properly preserved. In the experiments, we not only show that our P2C achieves state-of-the-art performances on nocaps and COCO Caption datasets, we also verify the effectiveness and flexibility of our learning framework by replacing language and cross-modality association models for NOC. Implementation details and code are available in the supplementary materials.

updated: Sun Sep 25 2022 22:56:04 GMT+0000 (UTC)

published: Sun Sep 25 2022 22:56:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト