ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations

Robin Karlsson; Tomoki Hayashi; Keisuke Fujii; Alexander Carballo; Kento Ohtani; Kazuya Takeda

ViCE：コンテキストおよびピクセルの外観不変の意味表現としての自己監視型ビジュアルコンセプト埋め込み

この作品は、NLPで単語の埋め込みを学習する方法に触発された画像の、意味的に豊富な視覚的概念の埋め込みを学習するための自己監視方法を示しています。私たちの方法は、より表現力豊かな埋め込みを生成し、高解像度画像に適用できるようにすることで、以前の作業を改善します。自然画像の生成を、一連の潜在的な視覚的概念が観察可能なピクセルの外観を生み出す確率過程と見なし、ピクセルから概念への逆マッピングを学習するために、私たちの方法を定式化します。私たちの方法は、ピクセルから視覚的にコヒーレントな領域の小さなセットへの自然な階層的ステップとしてスーパーピクセル化を導入することにより、高密度埋め込みマップの自己監視学習の有効性を大幅に向上させます。追加の貢献は、視覚的にコヒーレントなパッチに一致する不均一な形状による地域のコンテキストマスキングと、マスクされた言語モデルに触発された複雑さベースのビューサンプリングです。 COCO（+12.94 mIoU、+ 87.6％）と都市の景観（+16.52 mIoU、+ 134.2％）の最先端の表現品質ベンチマークを大幅に改善することで、高密度の埋め込みの表現力が向上しました。結果は、以前の研究では実証されていない、好ましいスケーリングとドメインの一般化の特性を示しています。

This work presents a self-supervised method to learn dense semantically rich visual concept embeddings for images inspired by methods for learning word embeddings in NLP. Our method improves on prior work by generating more expressive embeddings and by being applicable for high-resolution images. Viewing the generation of natural images as a stochastic process where a set of latent visual concepts give rise to observable pixel appearances, our method is formulated to learn the inverse mapping from pixels to concepts. Our method greatly improves the effectiveness of self-supervised learning for dense embedding maps by introducing superpixelization as a natural hierarchical step up from pixels to a small set of visually coherent regions. Additional contributions are regional contextual masking with nonuniform shapes matching visually coherent patches and complexity-based view sampling inspired by masked language models. The enhanced expressiveness of our dense embeddings is demonstrated by significantly improving the state-of-the-art representation quality benchmarks on COCO (+12.94 mIoU, +87.6%) and Cityscapes (+16.52 mIoU, +134.2%). Results show favorable scaling and domain generalization properties not demonstrated by prior work.

updated: Wed Nov 24 2021 12:27:30 GMT+0000 (UTC)

published: Wed Nov 24 2021 12:27:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト