A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with Batch Normalization and Knowledge Distillation

Omar Seddati; Nathan Hubens; Stéphane Dupont; Thierry Dutoit

効率的な SBIR モデルのレシピ: 相対的な三重項損失とバッチ正規化および知識蒸留の組み合わせ

スケッチベースの画像検索 (SBIR) は、マルチメディア検索における重要なタスクであり、その目標は、指定されたスケッチクエリに一致する一連の画像を取得することです。研究者らはすでに、このタスクに対して優れたパフォーマンスを発揮するソリューションをいくつか提案していますが、そのほとんどは、三重項損失、四重項損失、データ拡張の追加、エッジ抽出の使用など、さまざまなアプローチを通じて埋め込みを強化することに焦点を当てています。この作品では、さまざまな角度からこの問題に取り組みます。まずトレーニングデータの品質を調査し、その制限のいくつかを示します。次に、アンカーの類似性に基づく損失の重み付けを通じてこれらの制限を克服するために適応されたトリプレット損失である相対トリプレット損失 (RTL) を導入します。一連の実験を通じて、三重項損失を RTL に置き換えることで、データの増強を必要とせずに以前の最先端技術を上回るパフォーマンスが得られることを実証しました。さらに、バッチ正規化が l2 正規化よりも SBIR 埋め込みに適している理由を示し、モデルのパフォーマンスが大幅に向上することを示します。さらに、写真およびスケッチのドメインに必要なモデルの容量を調査し、写真エンコーダがスケッチエンコーダよりも高い容量を必要とすることを実証し、[34] で定式化された仮説を検証します。次に、知識の蒸留による精度の損失をわずかに抑えながら、ShuffleNetv2 [22] などの小さなモデルを効率的にトレーニングするための簡単なアプローチを提案します。より大きなモデルで使用した同じアプローチにより、以前の最先端の結果を上回るパフォーマンスが得られ、The Sketchy Database [30] で k = 1 で 62.38% の再現率を達成することができました。

Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query. Researchers have already proposed several well-performing solutions for this task, but most focus on enhancing embedding through different approaches such as triplet loss, quadruplet loss, adding data augmentation, and using edge extraction. In this work, we tackle the problem from various angles. We start by examining the training data quality and show some of its limitations. Then, we introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome those limitations through loss weighting based on anchors similarity. Through a series of experiments, we demonstrate that replacing a triplet loss with RTL outperforms previous state-of-the-art without the need for any data augmentation. In addition, we demonstrate why batch normalization is more suited for SBIR embeddings than l2-normalization and show that it improves significantly the performance of our models. We further investigate the capacity of models required for the photo and sketch domains and demonstrate that the photo encoder requires a higher capacity than the sketch encoder, which validates the hypothesis formulated in [34]. Then, we propose a straightforward approach to train small models, such as ShuffleNetv2 [22] efficiently with a marginal loss of accuracy through knowledge distillation. The same approach used with larger models enabled us to outperform previous state-of-the-art results and achieve a recall of 62.38% at k = 1 on The Sketchy Database [30].

updated: Tue May 30 2023 12:41:04 GMT+0000 (UTC)

published: Tue May 30 2023 12:41:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト