The style transformer with common knowledge optimization for image-text retrieval

Wenrui Li; Zhengyu Ma; Jinqiao Shi; Xiaopeng Fan

画像とテキストを取得するための常識的な最適化を備えたスタイルトランスフォーマー

さまざまなモダリティを関連付ける画像テキスト検索は、その優れた研究価値と幅広い実世界への応用により、幅広い注目を集めています。ただし、既存の方法のほとんどは、高レベルのセマンティックな関係 (「スタイルの埋め込み」) とマルチモダリティからの一般的な知識を十分に考慮していません。この目的のために、画像テキスト検索のための共通知識最適化 (CKSTN) を備えた新しいスタイルのトランスフォーマーネットワークを導入します。メインモジュールは、スタイル埋め込みエクストラクタ (SEE) と共通知識最適化 (CKO) モジュールの両方を備えた共通知識アダプター (CKA) です。具体的には、SEE は順次更新戦略を使用して、SEE のさまざまな段階の機能を効果的に接続します。 CKO モジュールは、さまざまなモダリティから共通の知識の潜在的な概念を動的にキャプチャするために導入されています。さらに、一般化された一時的な共通知識を取得するために、SEE のさまざまなレイヤーの機能を以前の共通機能ユニットと効果的に統合する順次更新戦略を提案します。 CKSTN は、MSCOCO および Flickr30K データセットでの画像テキスト検索における最先端の方法の優位性を示しています。さらに、CKSTN は軽量トランスに基づいて構築されており、パフォーマンスが向上し、パラメーターが低いため、実際のシーンのアプリケーションにより便利で実用的です。

Image-text retrieval which associates different modalities has drawn broad attention due to its excellent research value and broad real-world application. However, most of the existing methods haven't taken the high-level semantic relationships ("style embedding") and common knowledge from multi-modalities into full consideration. To this end, we introduce a novel style transformer network with common knowledge optimization (CKSTN) for image-text retrieval. The main module is the common knowledge adaptor (CKA) with both the style embedding extractor (SEE) and the common knowledge optimization (CKO) modules. Specifically, the SEE uses the sequential update strategy to effectively connect the features of different stages in SEE. The CKO module is introduced to dynamically capture the latent concepts of common knowledge from different modalities. Besides, to get generalized temporal common knowledge, we propose a sequential update strategy to effectively integrate the features of different layers in SEE with previous common feature units. CKSTN demonstrates the superiorities of the state-of-the-art methods in image-text retrieval on MSCOCO and Flickr30K datasets. Moreover, CKSTN is constructed based on the lightweight transformer which is more convenient and practical for the application of real scenes, due to the better performance and lower parameters.

updated: Mon Apr 03 2023 11:17:11 GMT+0000 (UTC)

published: Wed Mar 01 2023 12:17:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト