Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Chia-Wen Kuo; Zsolt Kira

事前にトレーニングされたオブジェクト検出器を超えて：画像キャプションのためのクロスモーダルテキストおよびビジュアルコンテキスト

視覚的なキャプションは大幅に進歩しており、主に事前トレーニングされた機能と、自己回帰モデルへの豊富な入力として機能する後の固定オブジェクト検出器に依存しています。ただし、このような方法の主な制限は、モデルの出力がオブジェクト検出器の出力のみに条件付けられることです。特に検出器がデータセット間で転送される場合、そのような出力がすべての必要な情報を表すことができるという仮定は非現実的です。この作業では、この仮定によって誘発されるグラフィカルモデルについて推論し、オブジェクトの関係などの欠落している情報を表すために補助入力を追加することを提案します。 Visual Genomeデータセットから属性と関係をマイニングし、それらにキャプションモデルを条件付けることを具体的に提案します。重要なのは、そのようなコンテキスト記述を取得するために、マルチモーダル事前トレーニングモデル（CLIP）の使用を提案する（そして重要であることを示す）ことです。さらに、オブジェクト検出器モデルは凍結されており、キャプションモデルがそれらを適切に接地するのに十分な豊富さがありません。その結果、画像上の検出器と説明の両方の出力を調整し、これが接地を改善できることを定性的および定量的に示すことを提案します。画像のキャプションに関する方法を検証し、各コンポーネントと事前にトレーニングされたマルチモーダルモデルの重要性を徹底的に分析し、現在の最先端技術、具体的にはCIDErで+ 7.5％、BLEUで+ 1.3％を大幅に改善します。 -4つのメトリック。

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.

updated: Mon May 09 2022 15:05:24 GMT+0000 (UTC)

published: Mon May 09 2022 15:05:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト