Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in Only a SINGLE Network

Chull Hwan Song; Taebaek Hwang; Jooyoung Yoon; Shunghyun Choi; Yeong Hyeon Gu

単一ネットワークのみでのもつれなしのマルチスペース埋め込みのための条件付きクロスアテンションネットワーク

視覚タスクに関する多くの研究は、画像内の単一ラベルのオブジェクトを予測するための効果的な埋め込みスペースを作成することを目的としています。ただし、実際には、ほとんどのオブジェクトは形状、色、長さなどの複数の特定の属性を持ち、各属性はさまざまなクラスで構成されます。現実世界のシナリオにモデルを適用するには、オブジェクトの詳細なコンポーネントを区別できることが不可欠です。複数の特定の属性を単一のネットワークに埋め込む従来のアプローチでは、多くの場合、各属性のきめ細かい特徴を個別に識別することができず、もつれが発生します。この問題に対処するために、単一のバックボーンのみを使用して、さまざまな特定の属性のもつれのないマルチスペース埋め込みを誘導する条件付きクロスアテンションネットワークを提案します。まず、条件（特定の属性）の情報を融合・切り替えするクロスアテンション機構を採用し、その有効性を多様な可視化例を通じて実証します。次に、ビジョントランスフォーマーを初めて詳細な画像検索タスクに活用し、既存の方法と比較してシンプルでありながら効果的なフレームワークを提示します。ベンチマークデータセットに応じてパフォーマンスが異なる以前の研究とは異なり、私たちが提案した手法は、FashionAI、DARN、DeepFashion、Zappos50K ベンチマークデータセットで一貫した最先端のパフォーマンスを達成しました。

Many studies in vision tasks have aimed to create effective embedding spaces for single-label object prediction within an image. However, in reality, most objects possess multiple specific attributes, such as shape, color, and length, with each attribute composed of various classes. To apply models in real-world scenarios, it is essential to be able to distinguish between the granular components of an object. Conventional approaches to embedding multiple specific attributes into a single network often result in entanglement, where fine-grained features of each attribute cannot be identified separately. To address this problem, we propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone. Firstly, we employ a cross-attention mechanism to fuse and switch the information of conditions (specific attributes), and we demonstrate its effectiveness through a diverse visualization example. Secondly, we leverage the vision transformer for the first time to a fine-grained image retrieval task and present a simple yet effective framework compared to existing methods. Unlike previous studies where performance varied depending on the benchmark dataset, our proposed method achieved consistent state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K benchmark datasets.

updated: Tue Jul 25 2023 04:48:03 GMT+0000 (UTC)

published: Tue Jul 25 2023 04:48:03 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト