More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching

Yuxiao Chen; Jianbo Yuan; Long Zhao; Tianlang Chen; Rui Luo; Larry Davis; Dimitris N. Metaxas

単なる注意以上のもの：画像とテキストのマッチングのための対照的な制約によるクロスモーダル注意の改善

クロスモーダル注意メカニズムは、画像とテキストのマッチングタスクに広く適用されており、さまざまなモダリティ間できめ細かい関連性を学習する機能のおかげで、目覚ましい改善を達成しています。ただし、既存の方法のクロスモーダル注意モデルは、トレーニングプロセス中に直接の監視が提供されないため、最適ではなく、不正確になる可能性があります。この作業では、このような制限に対処するために、2つの新しいトレーニング戦略、つまり、対照的なコンテンツリソース（CCR）と対照的なコンテンツスワッピング（CCS）の制約を提案します。これらの制約は、明示的な注意注釈を必要とせずに、対照的な学習方法でクロスモーダル注意モデルのトレーニングを監督します。これらはプラグイントレーニング戦略であり、既存のクロスモーダル注意モデルに簡単に統合できます。さらに、学習した注意モデルの品質を定量的に測定するために、注意の適合率、再現率、F1-スコアを含む3つのメトリックを導入します。提案された制約を、4つの最先端のクロスモーダル注意ベースの画像テキストマッチングモデルに組み込むことによって評価します。 Flickr30kデータセットとMS-COCOデータセットの両方での実験結果は、これらの制約を統合すると、検索パフォーマンスと注意メトリックの両方の観点からモデルのパフォーマンスが向上することを示しています。

Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be easily integrated into existing cross-modal attention models. Additionally, we introduce three metrics including Attention Precision, Recall, and F1-Score to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints improves the model performance in terms of both retrieval performance and attention metrics.

updated: Mon Oct 03 2022 21:48:05 GMT+0000 (UTC)

published: Thu May 20 2021 08:48:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト