Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

Ziyan Yang; Kushal Kafle; Franck Dernoncourt; Vicente Ordonez

一貫した勾配ベースの説明を奨励することによる視覚的接地の改善

領域レベルの注釈と一致する勾配ベースの説明を促進する視覚言語モデルの事前トレーニングのマージンベースの損失を提案します。この目標をAttentionMaskConsistency（AMC）と呼び、FasterR-CNNなどのオブジェクト検出器を明示的にトレーニングするために領域レベルの注釈に依存するモデルと比較して優れた視覚的接地性能を生み出すことを示します。 AMCは、そのような注釈を含む画像の主に注釈付きの関心領域内に注意スコアを集中させる勾配ベースの説明マスクを奨励することによって機能します。特に、標準の視覚言語モデリング目標に加えてAMCでトレーニングされたモデルは、Flickr30kビジュアルグラウンディングベンチマークで86.59％の最先端の精度を達成し、以前の最良のモデルと比較して5.48％の絶対的な改善を実現します。私たちのアプローチはまた、表現の理解を参照するための確立されたベンチマークで非常にうまく機能し、人間の注釈とよりよく一致する勾配ベースの説明の設計によって追加の利点を提供します。

We propose a margin-based loss for vision-language model pretraining that encourages gradient-based explanations that are consistent with region-level annotations. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding performance compared to models that rely instead on region-level annotations for explicitly training an object detector such as Faster R-CNN. AMC works by encouraging gradient-based explanation masks that focus their attention scores mostly within annotated regions of interest for images that contain such annotations. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.59% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.48% when compared to the best previous model. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension and offers the added benefit by design of gradient-based explanations that better align with human annotations.

updated: Tue Jul 05 2022 17:28:52 GMT+0000 (UTC)

published: Thu Jun 30 2022 17:55:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト