BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid Counterfactual Training for Robust Content-based Image Retrieval

Wenqiao Zhang; Jiannan Guo; Mengze Li; Haochen Shi; Shengyu Zhang; Juncheng Li; Siliang Tang; Yueting Zhuang

BOSS：ロバストなコンテンツベースの画像検索のためのハイブリッド反事実トレーニングによるボトムアップクロスモーダルセマンティックコンポジション

コンテンツベースの画像検索（CIR）は、サンプル画像と補足テキストの構成を同時に理解することでターゲット画像を検索することを目的としています。これは、インターネット検索やファッション検索など、さまざまな実世界のアプリケーションに影響を与える可能性があります。このシナリオでは、入力画像は検索の直感的なコンテキストと背景として機能しますが、対応する言語は、目的のターゲット画像を取得するためにクエリ画像の特定の特性を変更する方法に関する新しい特性を明示的に要求します。このタスクは、クロスグラニュラーセマンティックアップデートを組み込むことによって複合画像テキスト表現を学習および理解する必要があるため、困難です。このホワイトペーパーでは、ハイブリッド反事実トレーニングフレームワークを備えた新しいボトムアップクロスモーダルセマンティックコンポジション（BOSS）によってこのタスクに取り組みます。これは、これまで見過ごされてきた2つの観点からCIRタスクを研究することで、CIRタスクに新たな光を当てます。視覚言語表現とクエリターゲット構築の明示的にきめ細かい対応の。一方では、下部のローカル特性から上部のグローバルセマンティクスへのクロスモーダル埋め込みの暗黙的な相互作用と構成を活用し、効果的なターゲット画像検索のために、言語セマンティクスを条件とする視覚表現をいくつかの連続したステップで保存および変換します。一方、同様のクエリに対するモデルのあいまいさを減らすことができるハイブリッド反事実トレーニング戦略を考案します。

Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text, which potentially impacts a wide variety of real-world applications, such as internet search and fashion retrieval. In this scenario, the input image serves as an intuitive context and background for the search, while the corresponding language expressly requests new traits on how specific characteristics of the query image should be modified in order to get the intended target image. This task is challenging since it necessitates learning and understanding the composite image-text representation by incorporating cross-granular semantic updates. In this paper, we tackle this task by a novel Bottom-up crOss-modal Semantic compoSition (BOSS) with Hybrid Counterfactual Training framework, which sheds new light on the CIR task by studying it from two previously overlooked perspectives: implicitly bottom-up composition of visiolinguistic representation and explicitly fine-grained correspondence of query-target construction. On the one hand, we leverage the implicit interaction and composition of cross-modal embeddings from the bottom local characteristics to the top global semantics, preserving and transforming the visual representation conditioned on language semantics in several continuous steps for effective target image search. On the other hand, we devise a hybrid counterfactual training strategy that can reduce the model's ambiguity for similar queries.

updated: Sat Jul 09 2022 07:14:44 GMT+0000 (UTC)

published: Sat Jul 09 2022 07:14:44 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト