VirTex: Learning Visual Representations from Textual Annotations

Karan Desai; Justin Johnson

VirTex：テキスト注釈からの視覚的表現の学習

多くのビジョンタスクへの事実上のアプローチは、ImageNetの教師付きトレーニングを通じて通常学習される、事前トレーニング済みの視覚表現から始めることです。最近の方法では、膨大な量のラベル付けされていない画像にスケーリングするために、教師なしの事前トレーニングが検討されています。それとは対照的に、私たちはより少ない画像から高品質の視覚表現を学ぶことを目指しています。このために、監視付き事前トレーニングを再検討し、分類ベースの事前トレーニングに代わるデータ効率の高い代替手段を模索します。私たちはVirTexを提案します-視覚的表現を学習するために意味論的に密なキャプションを使用する事前トレーニングアプローチ。 COCOキャプションで最初から畳み込みネットワークをトレーニングし、それらを画像分類、オブジェクト検出、インスタンスセグメンテーションなどのダウンストリーム認識タスクに転送します。すべてのタスクで、VirTexは、ImageNetで学習した機能（監視ありまたは監視なし）と同等またはそれ以上の機能を提供します。

The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.

updated: Tue Mar 02 2021 12:03:24 GMT+0000 (UTC)

published: Thu Jun 11 2020 17:58:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト