Caption supervision enables robust learners

Benjamin Feuer; Ameya Joshi; Chinmay Hegde

キャプションの監視により、堅牢な学習が可能になります

CLIP のようなビジョン言語モデルは、自然な分布の変化に対して堅牢です。これは、CLIP がキャプション監視と呼ばれる手法を使用して非構造化データを学習するためです。モデルは、画像にリンクされたテキストをグラウンドトゥルースラベルとして解釈します。慎重に管理された比較研究では、標準的なクロスエントロピー損失でトレーニングされた CNN も、同じデータに対してキャプション監視の恩恵を受けることができ、場合によっては VL モデルよりもさらにメリットがあることを示しています。高精度のキャプション教師ありモデルを使用した将来の実験を容易にするために、CaptionNet (https://github.com/penfever/CaptionNet/) を導入します。 Web スクレイピングされたキャプションを含む準拠サンプル。 CaptionNet での一連の実験では、損失関数の選択、データのフィルタリング、および監視戦略によって、堅牢なコンピュータービジョンがどのように実現されるかを示します。また、実験を再現するために必要なコードベースも https://github.com/penfever/vlhub/ で提供しています。

Vision language models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that CNNs trained on a standard cross-entropy loss can also benefit from caption supervision, in some cases even more than VL models, on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (https://github.com/penfever/CaptionNet/), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at https://github.com/penfever/vlhub/

updated: Thu Oct 13 2022 22:29:10 GMT+0000 (UTC)

published: Thu Oct 13 2022 22:29:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト