The style transformer with common knowledge optimization for image-text retrieval

Wenrui Li; Zhengyu Ma; Xiaopeng Fan

画像とテキストを取得するための常識的な最適化を備えたスタイルトランスフォーマー

さまざまなモダリティを関連付ける画像テキスト検索は、その優れた研究価値と幅広い実世界への応用により、幅広い注目を集めています。アルゴリズムは常に更新されていますが、それらのほとんどは、ハイレベルなセマンティック関係 (「スタイルの埋め込み」) とマルチモダリティからの一般的な知識を十分に考慮していません。この目的のために、画像テキスト検索のための共通知識最適化 (CKSTN) を備えた新しいスタイルの変換ネットワークを提案します。メインモジュールは、スタイル埋め込みエクストラクタ (SEE) と共通知識最適化 (CKO) モジュールの両方を備えた共通知識アダプター (CKA) です。具体的には、SEE は高レベルの特徴を効果的に抽出するように設計されています。 CKO モジュールは、さまざまなモダリティから共通の知識の潜在的な概念を動的にキャプチャするために導入されています。これらを組み合わせることで、軽量トランスフォーマーでのアイテム表現の形成を支援できます。さらに、一般化された一時的な共通知識を取得するために、SEE のさまざまなレイヤーの機能を以前の共通機能ユニットと効果的に統合する順次更新戦略を提案します。 CKSTN は、MSCOCO および Flickr30K データセットでの画像テキスト検索における最先端の方法の結果よりも優れています。さらに、CKSTN は、より優れたパフォーマンスとより低いパラメーターにより、実際のシーンのアプリケーションにとってより便利で実用的です。

Image-text retrieval which associates different modalities has drawn broad attention due to its excellent research value and broad real-world application. While the algorithms keep updated, most of them haven't taken the high-level semantic relationships ("style embedding") and common knowledge from multi-modalities into full consideration. To this end, we propose a novel style transformer network with common knowledge optimization (CKSTN) for image-text retrieval. The main module is the common knowledge adaptor (CKA) with both the style embedding extractor (SEE) and the common knowledge optimization (CKO) modules. Specifically, the SEE is designed to effectively extract high-level features. The CKO module is introduced to dynamically capture the latent concepts of common knowledge from different modalities. Together, they could assist in the formation of item representations in lightweight transformers. Besides, to get generalized temporal common knowledge, we propose a sequential update strategy to effectively integrate the features of different layers in SEE with previous common feature units. CKSTN outperforms the results of state-of-the-art methods in image-text retrieval on MSCOCO and Flickr30K datasets. Moreover, CKSTN is more convenient and practical for the application of real scenes, due to the better performance and lower parameters.

updated: Wed Mar 01 2023 12:17:33 GMT+0000 (UTC)

published: Wed Mar 01 2023 12:17:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト