CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

Aneeshan Sain; Ayan Kumar Bhunia; Pinaki Nath Chowdhury; Subhadeep Koley; Tao Xiang; Yi-Zhe Song

CLIP for all things ゼロショットスケッチベースの画像検索、微粒度かどうか

このホワイトペーパーでは、ゼロショットスケッチベースの画像検索 (ZS-SBIR) に CLIP を活用します。私たちは、基本モデルの最近の進歩と、それらが提供すると思われる比類のない一般化機能に大きく影響を受けていますが、スケッチコミュニティに利益をもたらすように調整するのは初めてです。カテゴリー設定ときめ細かな設定 (「すべて」) の両方について、この相乗効果を最大限に実現する方法について斬新なデザインを提案しました。私たちのソリューションの核となるのは、迅速な学習セットアップです。最初に、スケッチ固有のプロンプトを考慮に入れるだけで、すべての先行技術を大幅に上回る (24.8%) カテゴリレベルの ZS-SBIR システムが既にあることを示します。 .ただし、きめの細かい設定に移行するのは難しく、この相乗効果をさらに深く掘り下げる必要があります。そのために、問題のきめの細かいマッチングの性質に取り組むために、2 つの特定の設計を考え出します。ゴールドスタンダードのスタンドアロントリプレット損失、および (ii) スケッチと写真のペア間でインスタンスレベルの構造的対応を確立するのに役立つ巧妙なパッチシャッフル手法。これらの設計により、以前の最先端技術に比べて 26.9% の領域で大幅なパフォーマンスの向上が再び観察されました。持ち帰るメッセージがあるとすれば、提案された CLIP であり、迅速な学習パラダイムは、データ不足が大きな課題である他のスケッチ関連のタスク (ZS-SBIR に限定されない) に取り組む上で大きな可能性を秘めています。コードとモデルが利用可能になります。

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Code and models will be made available.

updated: Thu Mar 23 2023 17:02:00 GMT+0000 (UTC)

published: Thu Mar 23 2023 17:02:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト