A Better Loss for Visual-Textual Grounding

Davide Rigoni; Luciano Serafini; Alessandro Sperduti

視覚的-テキスト的接地のためのより良い損失

テキストフレーズと画像が与えられた場合、視覚的根拠の問題は、文によって参照される画像のコンテンツを見つけるタスクです。これは、人間とコンピューターの相互作用、画像とテキストの参照の解像度、およびビデオとテキストの参照の解像度にいくつかの実際のアプリケーションがある困難な作業です。過去数年間、いくつかの研究は、以前よりも視覚とテキストの依存関係をよりよく捉えようとする、ますます大きく複雑なモデルを提案することによって、この問題に対処してきました。これらのモデルは通常、接地に役立つマルチモーダル機能を学習する方法と、視覚的な言及の予測されるバウンディングボックスを改善する方法にそれぞれ焦点を当てた2つの主要なコンポーネントで構成されています。これら2つのサブタスク間の適切な学習バランスを見つけることは容易ではなく、現在のモデルはこの問題に関して必ずしも最適ではありません。この作業では、次のようなバウンディングボックスクラスの確率に基づく損失関数を提案します。（i）バウンディングボックスの選択を改善します。（ii）バウンディングボックスの座標予測を改善します。私たちのモデルは、単純なマルチモーダル特徴融合コンポーネントを使用していますが、広く採用されている2つのデータセットで最先端のモデルよりも高い精度を達成でき、上記の2つのサブタスク間の学習バランスを向上させます。

Given a textual phrase and an image, the visual grounding problem is the task of locating the content of the image referenced by the sentence. It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution. In the last years, several works have addressed this problem by proposing more and more large and complex models that try to capture visual-textual dependencies better than before. These models are typically constituted by two main components that focus on how to learn useful multi-modal features for grounding and how to improve the predicted bounding box of the visual mention, respectively. Finding the right learning balance between these two sub-tasks is not easy, and the current models are not necessarily optimal with respect to this issue. In this work, we propose a loss function based on bounding boxes classes probabilities that: (i) improves the bounding boxes selection; (ii) improves the bounding boxes coordinates prediction. Our model, although using a simple multi-modal feature fusion component, is able to achieve a higher accuracy than state-of-the-art models on two widely adopted datasets, reaching a better learning balance between the two sub-tasks mentioned above.

updated: Wed Feb 02 2022 10:57:29 GMT+0000 (UTC)

published: Wed Aug 11 2021 16:26:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト