USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval

Yan Zhang; Zhong Ji; Di Wang; Yanwei Pang; Xuelong Li

USER: 画像テキスト検索のための Momentum Contrast による統合セマンティックエンハンスメント

言語と視覚の領域を橋渡しする基本的かつ困難なタスクとして、画像テキスト検索 (ITR) は、他のモダリティから特定のクエリに意味的に関連するターゲットインスタンスを検索することを目的としています。さまざまなモダリティにわたって。大幅な進歩が達成されましたが、既存のアプローチには通常、次の 2 つの大きな制限があります。(1) 各領域が同等に扱われるボトムアップの注意ベースの領域レベルの機能を直接利用することにより、表現の精度が損なわれます。（2）ミニバッチベースのエンドツーエンドトレーニングメカニズムを採用することにより、ネガティブサンプルペアの規模を制限します。これらの制限に対処するために、ITR の統合セマンティック強化運動量対照学習 (USER) メソッドを提案します。具体的には、2 つのシンプルだが効果的なグローバル表現ベースのセマンティックエンハンスメント (GSE) モジュールを慎重に設計します。 Self-Guided Enhancement (SGE) モジュールと呼ばれる自己注意アルゴリズムを介してグローバル表現を学習します。もう 1 つのモジュールは、CLIP-Guided Enhancement (CGE) モジュールと呼ばれる既製のモデルから知識を活用して転送するための新しいスキームを提供する事前トレーニング済みの CLIP モジュールの恩恵を受けます。さらに、MoCo のトレーニングメカニズムを ITR に組み込みます。このメカニズムでは、2 つの動的キューを使用して、ネガティブサンプルペアのスケールを強化および拡大します。一方、統合トレーニング目標 (UTO) は、ミニバッチベースおよび動的キューベースのサンプルから学習するために開発されています。ベンチマークの MSCOCO および Flickr30K データセットでの広範な実験により、検索精度と推論効率の両方の優位性が実証されています。ソースコードは https://github.com/zhangy0822/USER でリリースされます。

As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global representation based Semantic Enhancement (GSE) modules. One learns the global representation via the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module. The other module benefits from the pre-trained CLIP module, which provides a novel scheme to exploit and transfer the knowledge from an off-the-shelf model, noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the training mechanism of MoCo into ITR, in which two dynamic queues are employed to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified Training Objective (UTO) is developed to learn from mini-batch based and dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and Flickr30K datasets demonstrate the superiority of both retrieval accuracy and inference efficiency. Our source code will be released at https://github.com/zhangy0822/USER.

updated: Tue Jan 17 2023 12:42:58 GMT+0000 (UTC)

published: Tue Jan 17 2023 12:42:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト