I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

Sophia Gu; Christopher Clark; Aniruddha Kembhavi

画像がないなんて信じられない！言語データのみを使用した視覚タスクの学習

質問の解析、セマンティクスの比較と対比、記述の記述など、コンピュータービジョンタスクに必要な多くの高度なスキルは、自然言語処理などの他の分野でも必要です。このホワイトペーパーでは、これらのスキルをテキストデータから学習し、ビジュアルトレーニングデータでトレーニングすることなくビジョンタスクに移行できるかどうかを検討します。私たちのアプローチの鍵は、対照的に訓練されたビジョンと言語エンコーダーの共同埋め込みスペースを活用することです。実際には、対照的なモデルのさまざまなモダリティの埋め込みスペースには体系的な違いがある可能性があり、これらの違いがアプローチにどのように影響するかを分析し、この懸念を軽減するための戦略を研究します。画像キャプション、視覚的含意、視覚的質問応答、視覚的ニュースの 4 つの代表的なタスクについて、テキストトレーニングデータのみを使用してモデルを作成し、画像を使用した標準的なベンチマークで評価します。これらのモデルは一般的に、画像でトレーニングされたモデルに近いパフォーマンスを示し、このテキストのみの設定でのキャプションと視覚的含意に関する以前の作業を 9 ポイント以上上回り、ビジュアルニュースに関する以前のすべての作業を 30 ポイント以上上回っています。また、画像データや人間がキュレーションした言語データを使用せずに、書籍、Web、または言語モデルからすぐに入手できるテキストデータを使用してトレーニングされた、さまざまなスタイルの画像キャプションモデルも紹介します。

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether it is possible to learn those skills from textual data and then transfer them to vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study strategies to mitigate this concern. We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news, and evaluate them on standard benchmarks using images. We find these models generally perform close to models trained on images, while surpassing prior work for captioning and visual entailment in this text only setting by over 9 points, and outperforming all prior work on visual news by over 30 points. We also showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data, but instead using readily-available text data from books, the web, or language models.

updated: Tue Mar 21 2023 04:54:55 GMT+0000 (UTC)

published: Thu Nov 17 2022 18:52:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト