Benchmarking Omni-Vision Representation through the Lens of Visual Realms

Yuanhan Zhang; Zhenfei Yin; Jing Shao; Ziwei Liu

視覚領域のレンズによるオムニビジョン表現のベンチマーク

特定の視覚領域（顔、犬、場所など）で印象的なパフォーマンスが達成されていますが、多くの自然な視覚領域に一般化されたオムニビジョン表現が非常に望ましいです。ただし、既存のベンチマークは、オムニビジョン表現を評価するために偏っていて非効率的です。これらのベンチマークには、いくつかの特定のレルムのみが含まれるか、レルムが重複している多数のデータセットを含めることを犠牲にしてほとんどのレルムをカバーします。本稿では、Omni-Realm Benchmark（OmniBenchmark）を提案します。これには、7,372の概念と1,074,346の画像を含む21の領域ごとのデータセットが含まれています。セマンティックの重複がないため、これらのデータセットはほとんどの視覚的領域を包括的かつ効率的にカバーします。さらに、より良いオムニビジョン表現のために、新しい監視された対照学習フレームワーク、すなわちリレーショナル対照学習（ReCo）を提案します。 ReCoは、同じ概念から2つのインスタンスを引き寄せるだけでなく（典型的な監視対象の対照学習フレームワーク）、同じ意味領域から2つのインスタンスを引き寄せ、概念間の意味関係をエンコードし、オムニビジョン表現学習を促進します。 OmniBenchmarkでのアーキテクチャ（CNNからトランスフォーマーまで）および学習パラダイム（教師あり学習から自己教師あり学習まで）が異なるオムニビジョン表現研究におけるReCoおよびその他の進歩のベンチマークを行います。 ReCoが他の監視された対照的な学習方法よりも優れていることを示し、将来の研究を容易にするために複数の実際的な観察結果を明らかにします。

Though impressive performance has been achieved in specific visual realms (e.g. faces, dogs, and places), an omni-vision representation generalizing to many natural visual domains is highly desirable. But, existing benchmarks are biased and inefficient to evaluate the omni-vision representation -- these benchmarks either only include several specific realms, or cover most realms at the expense of subsuming numerous datasets that have extensive realm overlapping. In this paper, we propose Omni-Realm Benchmark (OmniBenchmark). It includes 21 realm-wise datasets with 7,372 concepts and 1,074,346 images. Without semantic overlapping, these datasets cover most visual realms comprehensively and meanwhile efficiently. In addition, we propose a new supervised contrastive learning framework, namely Relational Contrastive learning (ReCo), for a better omni-vision representation. Beyond pulling two instances from the same concept closer -- the typical supervised contrastive learning framework -- ReCo also pulls two instances from the same semantic realm closer, encoding the semantic relation between concepts, and facilitating omni-vision representation learning. We benchmark ReCo and other advances in omni-vision representation studies that are different in architectures (from CNNs to transformers) and in learning paradigms (from supervised learning to self-supervised learning) on OmniBenchmark. We illustrate the superior of ReCo to other supervised contrastive learning methods and reveal multiple practical observations to facilitate future research.

updated: Thu Jul 14 2022 17:58:02 GMT+0000 (UTC)

published: Thu Jul 14 2022 17:58:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト