Understanding Guided Image Captioning Performance across Domains

Edwin G. Ng; Bo Pang; Piyush Sharma; Radu Soricut

ドメイン間でのガイド付き画像キャプションのパフォーマンスを理解する

画像キャプションモデルは通常、ユーザーの関心を考慮に入れる機能を欠いており、通常、読みやすさ、情報量、および情報過多のバランスをとろうとするグローバルな説明がデフォルトになります。一方、VQAモデルは一般に、テキストによる質問が非常に正確であることを期待しながら、長い説明的な回答を提供する機能を欠いています。画像内の接地可能または接地不可能な概念のいずれかを参照するガイドテキストと呼ばれる追加の入力を使用して、画像のキャプションが焦点を当てるべき概念を制御する方法を提示します。私たちのモデルは、ガイドテキストとグローバルおよびオブジェクトレベルの画像機能を使用して、ガイド付きキャプションの生成に使用される早期融合表現を導出するTransformerベースのマルチモーダルエンコーダで構成されています。 Visual Genomeデータでトレーニングされたモデルには、自動オブジェクトラベルでガイドされたときにうまくフィットするというドメイン内の利点がありますが、概念キャプションでトレーニングされたガイド付きキャプションモデルは、ドメイン外の画像とガイドテキストでより一般化されることがわかります。私たちの人間による評価の結果は、実際のガイド付き画像キャプションを試みるには、大規模で制限のないドメイントレーニングデータセットへのアクセスが必要であり、スタイルの多様性の増加（一意のトークンの数を増やさなくても）がパフォーマンス向上の重要な要因であることを示しています。

Image captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. On the other hand, VQA models generally lack the ability to provide long descriptive answers, while expecting the textual question to be quite precise. We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text that refers to either groundable or ungroundable concepts in the image. Our model consists of a Transformer-based multimodal encoder that uses the guiding text together with global and object-level image features to derive early-fusion representations used to generate the guided caption. While models trained on Visual Genome data have an in-domain advantage of fitting well when guided with automatic object labels, we find that guided captioning models trained on Conceptual Captions generalize better on out-of-domain images and guiding texts. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.

updated: Sat Sep 11 2021 18:57:22 GMT+0000 (UTC)

published: Fri Dec 04 2020 00:05:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト