I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

Sophia Gu; Christopher Clark; Aniruddha Kembhavi

画像がないなんて信じられない！言語監督のみを使用した視覚的タスクの学習

質問の解析、セマンティクスの比較と対照、説明の作成など、コンピュータービジョンのタスクに必要な多くの高度なスキルは、自然言語処理などの他の領域でも必要です。この論文では、視覚トレーニングデータでトレーニングすることなく、テキストデータからそれらのスキルを学習し、それを視覚タスクに移すことが可能かどうかを尋ねます。私たちのアプローチの鍵は、対照的に訓練された視覚エンコーダーと言語エンコーダーの共同埋め込み空間を活用することです。実際には、対照的なモデルにおけるさまざまなモダリティの埋め込み空間間に体系的な違いが存在する可能性があり、これらの違いがアプローチにどのように影響するかを分析し、この懸念を軽減するための戦略を研究します。画像キャプション、視覚的含意、視覚的質問応答、視覚的ニュースキャプションの 4 つの代表的なタスクについて、テキストトレーニングデータのみを使用してモデルを作成し、画像を使用した標準ベンチマークで評価します。これらのモデルは、画像でトレーニングされたモデルに近いパフォーマンスを示しながら、このテキストのみの設定におけるキャプションと視覚的含意に関する以前の研究を 9 ポイント以上上回り、ビジュアルニュースに関する以前のすべての研究を 30 ポイント以上上回っていることがわかりました。また、画像データや人間が厳選した言語データを使用せず、代わりに書籍、ウェブ、または言語モデルからすぐに入手できるテキストデータを使用してトレーニングされた、さまざまな文体の画像キャプションモデルも紹介します。

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether it is possible to learn those skills from text data and then transfer them to vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study strategies to mitigate this concern. We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images. We find these models perform close to models trained on images, while surpassing prior work for captioning and visual entailment in this text-only setting by over 9 points, and outperforming all prior work on visual news by over 30 points. We also showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data, but instead using readily-available text data from books, the web, or language models.

updated: Fri Aug 18 2023 23:43:42 GMT+0000 (UTC)

published: Thu Nov 17 2022 18:52:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト