Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style

Fengyin Lin; Mingkang Li; Da Li; Timothy Hospedales; Yi-Zhe Song; Yonggang Qi

ゼロショットエブリシングスケッチベースの画像検索、および説明可能なスタイル

この論文では、ゼロショートスケッチベースの画像検索 (ZS-SBIR) の問題を研究していますが、先行技術との 2 つの重要な差別化要因 (i) ZS のすべてのバリアント (カテゴリ間、カテゴリ内、クロスデータセット) に取り組んでいます。 -SBIR は 1 つのネットワークのみ (「すべて」) であり、(ii) このスケッチと写真のマッチングがどのように機能するか (「説明可能」) を本当に理解したいと考えています。私たちの重要な革新は、このようなクロスモーダルマッチングの問題が、主要なローカルパッチのグループの比較に還元できるという認識にありました。この変更だけで、前述の目標の両方を達成でき、外部のセマンティック知識が不要になるという追加の利点があります。技術的には、私たちのものはトランスフォーマーベースのクロスモーダルネットワークであり、3 つの新しいコンポーネント (i) 最も有益なローカル領域に対応する視覚的トークンを生成する学習可能なトークナイザーを備えたセルフアテンションモジュール、(ii) クロスアテンションモジュール2 つのモダリティにわたるビジュアルトークン間のローカル対応を計算し、最後に (iii) カーネルベースの関係ネットワークを使用して、ローカルの推定一致を組み立て、スケッチと写真のペアの全体的な類似性メトリックを生成します。実験では、すべての ZS-SBIR 設定で実際に優れた性能を発揮することが示されています。すべての重要な説明可能な目標は、クロスモーダルトークン対応を視覚化することによって、また、一致するすべての写真パッチの普遍的な置換による写真合成へのスケッチを介して初めて達成されます。コードとモデルは https://github.com/buptLinfy/ZSE-SBIR で入手できます。

This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network (``everything''), and (ii) we would really like to understand how this sketch-photo matching operates (``explainable''). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned ``bag-of-words'' paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches. Code and model are available at https://github.com/buptLinfy/ZSE-SBIR.

updated: Sat Mar 25 2023 03:52:32 GMT+0000 (UTC)

published: Sat Mar 25 2023 03:52:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト