Improving Deep Metric Learning by Divide and Conquer

Artsiom Sanakoyeu; Pingchuan Ma; Vadim Tschernezki; Björn Ommer

分割統治によるディープメトリック学習の改善

ディープメトリックラーニング（DML）は、多くのコンピュータービジョンアプリケーションの基礎です。これは、入力ドメインから埋め込みスペースへのマッピングを学習することを目的としています。埋め込みスペースでは、意味的に類似したオブジェクトが近くにあり、異なるオブジェクトが別のオブジェクトから遠く離れています。トレーニングデータのターゲットの類似性は、グラウンドトゥルースクラスラベルの形式でユーザーによって定義されます。ただし、埋め込みスペースは、トレーニングデータでユーザーが提供した類似性を模倣することを学習しますが、トレーニング中には見られない新しいカテゴリにも一般化する必要があります。ユーザー提供のグラウンドトゥルーストレーニングラベルに加えて、多くの追加の視覚的要因（視点の変化や形状の特異性など）が存在し、オブジェクト間の類似性の概念が異なることを意味し、トレーニング中に見えない画像の一般化に影響を与えます。ただし、既存のアプローチは通常、利用可能なすべてのトレーニングデータの単一の埋め込みスペースを直接学習し、すべての異なるタイプの関係をエンコードするのに苦労し、うまく一般化されません。埋め込みスペースとデータを階層的に小さなサブパーツに分割することで、より表現力豊かな表現を構築することを提案します。トレーニングデータのより小さなサブセットに引き続き焦点を当て、その分散を減らし、データサブセットごとに異なる埋め込み部分空間を学習します。さらに、部分空間は、複雑さだけでなく、データの幅もカバーするように共同で学習されます。その後、征服段階で部分空間から最終的な埋め込みを構築します。提案されたアルゴリズムは、任意の既存のDMLメソッドの周りに配置できる透過的なラッパーとして機能します。私たちのアプローチは、CUB200-2011、CARS196、Stanford Online Products、In-shop Clothes、およびPKU VehicleIDデータセットを使用して評価された、画像検索、クラスタリング、および再識別タスクに関する最先端の機能を大幅に改善します。

Deep metric learning (DML) is a cornerstone of many computer vision applications. It aims at learning a mapping from the input domain to an embedding space, where semantically similar objects are located nearby and dissimilar objects far from another. The target similarity on the training data is defined by user in form of ground-truth class labels. However, while the embedding space learns to mimic the user-provided similarity on the training data, it should also generalize to novel categories not seen during training. Besides user-provided groundtruth training labels, a lot of additional visual factors (such as viewpoint changes or shape peculiarities) exist and imply different notions of similarity between objects, affecting the generalization on the images unseen during training. However, existing approaches usually directly learn a single embedding space on all available training data, struggling to encode all different types of relationships, and do not generalize well. We propose to build a more expressive representation by jointly splitting the embedding space and the data hierarchically into smaller sub-parts. We successively focus on smaller subsets of the training data, reducing its variance and learning a different embedding subspace for each data subset. Moreover, the subspaces are learned jointly to cover not only the intricacies, but the breadth of the data as well. Only after that, we build the final embedding from the subspaces in the conquering stage. The proposed algorithm acts as a transparent wrapper that can be placed around arbitrary existing DML methods. Our approach significantly improves upon the state-of-the-art on image retrieval, clustering, and re-identification tasks evaluated using CUB200-2011, CARS196, Stanford Online Products, In-shop Clothes, and PKU VehicleID datasets.

updated: Thu Sep 09 2021 02:57:34 GMT+0000 (UTC)

published: Thu Sep 09 2021 02:57:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト