Beyond the Deep Metric Learning: Enhance the Cross-Modal Matching with Adversarial Discriminative Domain Regularization

Li Ren; Kai Li; LiQiang Wang; Kien Hua

ディープメトリックラーニングを超えて：敵対的な識別ドメイン正則化によるクロスモーダルマッチングの強化

画像とテキストのモダリティ間で情報を一致させることは、視覚と自然言語処理の両方を含む多くのアプリケーションにとって基本的な課題です。目的は、視覚情報とテキスト情報の類似性を比較するための効率的な類似性メトリックを見つけることです。既存のアプローチは、主にローカルの視覚オブジェクトと共有スペース内の文の単語を注意メカニズムと一致させます。類似性の計算は、データ内の分布の特性を無視して、一致する特徴の単純な比較に基づいているため、一致するパフォーマンスは依然として制限されています。この論文では、視覚オブジェクトと文の単語の間の識別機能の分布を考慮した効率的な学習目標を使用して、この制限に対処します。具体的には、パラダイムメトリック学習の目的を超えて、各画像とテキストのペア内に識別データドメインのセットを構築するための新しい敵対的識別ドメイン正則化（ADDR）学習フレームワークを提案します。私たちのアプローチは、一般に、一致するペア間の隠れた空間の分布を調整することにより、既存のメトリック学習フレームワークの学習効率とパフォーマンスを向上させることができます。実験結果は、この新しいアプローチが、MS-COCOおよびFlickr30Kベンチマークでのいくつかの一般的なクロスモーダルマッチング手法（SCAN、VSRN、BFAN）の全体的なパフォーマンスを大幅に改善することを示しています。

Matching information across image and text modalities is a fundamental challenge for many applications that involve both vision and natural language processing. The objective is to find efficient similarity metrics to compare the similarity between visual and textual information. Existing approaches mainly match the local visual objects and the sentence words in a shared space with attention mechanisms. The matching performance is still limited because the similarity computation is based on simple comparisons of the matching features, ignoring the characteristics of their distribution in the data. In this paper, we address this limitation with an efficient learning objective that considers the discriminative feature distributions between the visual objects and sentence words. Specifically, we propose a novel Adversarial Discriminative Domain Regularization (ADDR) learning framework, beyond the paradigm metric learning objective, to construct a set of discriminative data domains within each image-text pairs. Our approach can generally improve the learning efficiency and the performance of existing metrics learning frameworks by regulating the distribution of the hidden space between the matching pairs. The experimental results show that this new approach significantly improves the overall performance of several popular cross-modal matching techniques (SCAN, VSRN, BFAN) on the MS-COCO and Flickr30K benchmarks.

updated: Tue Oct 27 2020 23:42:21 GMT+0000 (UTC)

published: Fri Oct 23 2020 01:48:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト